Query Execution

Download Report

Transcript Query Execution

CS157B - Fall 2004
Entity-Relationship Data Model
Entity Sets
ƒ a collection of similar entities
Attributes
ƒ properties of the entities in an entity set
Relationships
ƒ connection among two or more entity sets
Constraints
ƒ Keys : uniquely identify an entity within its
entity set
ƒ Single-value constraints : requirement that the
value in a certain context be unique
ƒ Referential integrity constraints : existence test
ƒ Domain constraints : range test
ƒ General constraints : data set constraints
E/R - Cont.
Constraints Are of the Schema
ƒ Should be part of the DB design
ƒ Must be carefully constructed
ƒ Can be troublesome if major redesign is
required
Subclasses
ƒ Special "isa" relationship
ƒ Similar to OO's definition
Weak Entity Sets
ƒ Some of its key may be composed of attributes
which belong to another entity set
ƒ Need additional qualifier
The Relational Data Model
Relation
ƒ Represented by a two-demensional table
Attribute
ƒ Represented as columns in a table
Schema
ƒ The name of a relation and the set of attributes
of this relation
ƒ A database schema consists of one of more
relation schemas
Tuple
ƒ A row in the table
Domain
ƒ Each attribute of a relation is an elementary
type, thus each tuple must be atomic
Relational - Cont
Converting Subclass Structures to Relations
ƒ E/R-Style Conversion
ƒ Object-Oriented Approach
ƒ Null Values
Functional Dependencies (FD)
ƒ An unique-value constraint in a relational
schema design
ƒ Keys of Relations
–{A(1), ..., A(n)} is a key of relation R if
no distinct tuples agree on all of the {...}
no other subset of {} determines all other
attributes of R, must be minimal
–If more than one key, then designate one of
the keys as the primary key
ƒ Superkey : Composed by a set of attributes
Relational - Cont
Design of Relational Database Schemas
ƒ Anomalies
–Redundancy : Many repeated information in
each tuple
–Update Anomalies : One tuple update does
not cause other tuples to be updated
–Deletion Anomalies : Lose other information
as a side effect when a set of values becomes
empty
ƒ Decomposing Relations
–Splitting the attributes of R to make the
schemas of two new relations
ƒ Boyce-Codd Normal Form (BCNF)
–The left side of every nontrivial FD must be a
superkey
–Recursively decompose into smaller sets of
tables until they all satisfy the BCNF rule
–Must be able to join all set of tables into the
original tuple
Relational - Cont
ƒ Third Nomal Form (3NF)
–A relation R is in 3NF if: whenever A(1)...A(n)
->B is a nontrivial FD, either {A(1)...A(n)} is a
superkey, or B is a member of some key
–May allow minimal redundancy in the end
ƒ Multivalued Dependencies (MVD)
–A statement that two sets of attributes in a
relation have sets of values that appear in all
possible combinations
–Example of MVD in p.118 figure 3.29
–No BCNF violation
ƒ Fourth Normal Form (4NF)
–A relation R is in 4NF if whenever A(a)..A(n)>->B(1)..B(m) is a nontrivial MVD,
{A(1)...A(n)} is a superkey
–Can be decomposed similar to the BCNF
decomposition algorithm
Other Data Models
Review of Object-Oriented Concepts
ƒ Type System
–Record structures
–Collection types
–Reference types
ƒ Classes and Objects
–Contain attributes and methods
ƒ Object Identity
–OID
ƒ Methods
–Functions within a class
ƒ Hierarchies
–Superclass and subclass relationship
Introduction to ODL
ƒ Object-Oriented Design
–Three main properties: attributes,
relationships, and methods
Other Data Models - Cont
ƒ Class Declarations
–e.g. class <name> { <list of properties> }
ƒ Attributes in ODL
–can be simple or complex type
ƒ Relationships in ODL
–e.g. in Movie: relationship Set<Star> stars;
ƒ Inverse Relationships
–e.g. in Star: relationship Set<Movie>
starredIn;
ƒ Methods in ODL
–signatures
ƒ Types in ODL
–Atomic types:
integer
float
character
character string
boolean
enumerations
Other Data Models - Cont
–Structure types
 e.g. class name
 Structures - similar to a tuple of values
–Collection types
Set - distinct unordered elements
Bag - unordered elements
List - ordered elements
Array - e.g. Array<char,10>
Dictionary - e.g. Dictionary(Keytype, Rangetype)
Additional ODL Concepts
ƒ Multiway Relationships in ODL
–ODL supports only binary relationships
–Use several binary, many-one relationship
instead
ƒ Subclasses in ODL
–Inherits all the properties of its superclass
Other Data Models - Cont
ƒ Multiple Inheritance in ODL
–Needs to resolve name conflicts from
multiple superclass
ƒ Keys in ODL
–Optional because of the existence of an OID
–Can be declared with one or more attributes
ODL Designs to Relational Designs
ƒ Many issues
–Possibly no key in ODL
–Relational can't handle structure and
collection types directly
–Convert any ODL type constructor can lead
to a BCNF violation
ƒ Representing ODL Relationships
–one relation for each inverse pairs
Other Data Models - Cont
Object-Relational Model
ƒ O/R Features
–Structured types for attributes
–Methods
–Identifiers for tuples (like OID)
–References
ƒ Compromise Between Pure OO and Relational
Semistructured Data
ƒ "Schemaless" - the schema is attached to the
data itself
ƒ Represented in a collection of nodes
ƒ Interior node describes the data with a label
ƒ Leaf node contains the data of any atomic type
ƒ Legacy-database problem
Relational Algebra
Algebra of Relational Operations
ƒ Bags rather than sets can be more efficient
depending on the operation such as an union
of two relations which contain duplicates
ƒ Components
–Variables that stand for relations and
Constants which are finite relations
–Expressions of relational algebra are referred
to as queries
ƒ Set Operations on Relations
–R u S, the union of R and S (distinct)
–R n S, the intersection of R and S (distinct)
–R - S, the difference of R and S, is the set of
elements that are in R but not in S; different
from S - R
–Tuples must be in the same order and
attribute types must be the same
Relational Algebra - Cont
ƒ Projection
–The projection operator is used to produce
from a relation R a new relation that has only
some of R's columns
–e.g. Views
ƒ Selection
–The selection operator produces a new
relation R with a subset of R's tuples; the
tuples in the resulting relation are those that
satisfy some condition C that involves the
attributes of R
–e.g. SQL select statement
ƒ Cartesian Product
–A cross-product of two sets R and S
–e.g. R x S ; if R has 2 tuples and S has 3
tuples, the result has 6 tuples
ƒ Natural Joins
–Use matching attributes
Relational Algebra - Cont
ƒ Theta-Joins
–Join two relations with a condition denoted
by 'C'
ƒ Combining Operations to Form Queries
–Multiple queries can be combined into a
complex query
–e.g. AND, OR, (), ...
ƒ Renaming
–Control the names of the attributes used for
relations that are constructed by applying
relational-algebra operations
ƒ Dependent and Independent Operations
–Some relational expression can be
"rewritten" in a different expression
–e.g. R n S = R - (R - S)
–e.g. Glass half-empty or half-full?
Relational Algebra - Cont
Extended Operators of Relational Algebra
ƒ Duplicate Elimination
–This operator converts a bag to a set
–e.g. SQL keyword DISTINCT
ƒ Aggregation Operators
–SUM(), AVG(), MIN(), MAX(), COUNT()
ƒ Grouping
–This operator allows us to group a relation
and/or aggregate some columns
ƒ Extended Projection
–This allows expressions as part of an
attribute list
ƒ Sorting Operator
–This operator is anomalous, in that it is the
only operator in our relational algrebra whose
result is a list of tuples, rather than a set
ƒ Outerjoin
–Takes care of the dangling tuples; denote
with special "null" symbol
SQL
Simple Queries in SQL
ƒ select A(...) from R where C
ƒ Projection & Selection in SQL
–select title, length from Movie where
studioName = 'Disney' and year = 1990
ƒ String Comparison
–Bit Strings (bit data)
e.g. B'011' or X'7ff'
–Lexicographic order
e.g. 'A' < 'B'; 'a' < 'b'; 'a' > 'B'?
Depends on encoding scheme or standard
(e.g. ASCII, UTF8)
–LIKE keyword
"s like p" denotes where s is a string and p is a
pattern
Special character %
ƒ Dates and Times
–DATE '2002-02-04'
–TIME '19:00:30.5'
SQL - Cont
ƒ Null Values
–3 cases: unknown, inapplicable, withheld
–NOT a constant
–Test expression for IS NULL
ƒ Truth-Value "unknown"
–Pitfalls regarding nulls
–e.g. select * from Movie
where length <= 120 or length > 120;
ƒ Ordering the Output
– ORDER BY <list of attributes>
– Can be ASC or DESC, default is ASC
 Queries with > 1 Relation
ƒ e.g. select name from Movie, MovieExec
where title = 'Star Wars' and producerC# =
cert#
ƒ Disambiguating Attributes
–select MovieStar.name, MovieExec.name
from MovieStar, MovieExec where
SQL - Cont
ƒ UNION, INTERSECT, EXCEPT Keywords
–Same logic as the set operators of u, n, and -
Subqueries
ƒ Can return a single constant or relations in the
WHERE clause
ƒ Can have relations appear in the FROM
clauses
ƒ Scalar Value
–An atomic value that can appear as one
component of a tuple (e.g. constant, attribute)
ƒ e.g. select name from MovieExec where
cert#=(select producerC# from Movie where
title='Star Wars');
ƒ Conditions Involving Relations
–If R is a relation, then EXISTS R is a
condition that is true if R is not empty
SQL - Cont
–s IN R is true if s is equal to one of the
values in R; s NOT IN R is true if s is equal to
no value in R
–s > ALL R is true if s is greater than every
value in unary relation R
–s > ANY R is true if s is greater than at least
one value in unary relation R
–EXISTS, ALL, ANY operators can be
negated by putting NOT in front of the entire
expression
ƒ Subqueries in FROM Clauses
–Can substitute a R in the FROM clause with
a subquery
–e.g. select name from MovieExec, (select
producerC# from Movie, StarsIn where title =
movieTitle and year = movieYear and
starname = 'Harrison Ford'
SQL - Cont
ƒ Cross Joins
–Known as Cartesian product or just product
–e.g. Movie CROSS JOIN StarsIn;
ƒ Natural Joins
–The join condition is that all pairs of attributes
from the two relations having a common
name are equated, and no other conditions
–One of each pair of equated attributes is
projected out
ƒ Outerjoins
–e.g. MovieStar NATURAL FULL OUTER
JOIN MovieExec;
Full-Relation Operations
ƒ Eliminating Duplicates
–Use the DISTINCT keyword in SELECT
–Performance consideration
SQL - Cont
ƒ Duplicates in U, I, and D
–By default, UID operations convert bags to
sets
–Use keyword ALL after UNION, INTERSECT
EXCEPT keywords to prevent the elimination
of duplicates
ƒ Grouping and Aggregation in SQL
–Use the special GROUP BY clause
ƒ Aggregation Operators
–SUM, AVG, MIN, MAX are used by applying
them to a scalar-valued expression, typically a
column name, in a SELECT clause
–COUNT(*) is used to counts all the tuples in
R that is constructed from the FROM clause
and WHERE clause of the query
ƒ HAVING Clause
–Use in conjunction with GROUP BY to
narrow the aggregated list
SQL - Cont
Database Modifications (DML)
ƒ Insertion
–Insert into R(A1...An) values (V1,...Vn);
–e.g. insert into Studio(name) values('S1');
–Can insert multiple tuples with subquery
–e.g. insert into Studio(name) select distinct
studioName from Movie where studioName
not in (select name from studio);
ƒ Deletion
–Delete from R where <condition>;
–Can delete multiple tuples with 1 delete
statement depending on <condition>
ƒ Updates
–Update R set <new-value assignment>
where <condition>;
–Can update multiple tuples with 1 update
statement depending on <condition>
SQL - Cont
DDL in SQL
ƒ Data Types
–Character strings, fixed or variable length
–Bit strings, fixed or variable length
–Boolean: true, false, unknown
–Integer or int; shortint
–Floating-point numbers
e.g. decimal(n,d) where n is total number of digits
with d is the decimal point from the right;
1234.56 can be described as decimal(6,2)
–Dates and times can be represented by the
data types DATE and TIME respectively
ƒ Table Declarations
– Use the keywords CREATE TABLE followed
by the R name and list of As and their types
–e.g. create table MovieStar(name char(30),
address varchar(256), gender char(1),
birthday DATE);
SQL - Cont
ƒ Modifying Relation Schemas
–Drop table MovieStar;
–Alter table MovieStar add phone char(16);
–Alter table MovieStar drop birthdate;
ƒ Default Values
–Use the DEFAULT keyword to set default
values for a column
–e.g. alter table MovieStar add phone
char(16) default 'unlisted';
ƒ Indexes
–Allow faster access to data
–e.g. create index YearIndex on Movie(year);
–Can be one or more attributes
–e.g. create index KeyIndex on
Movie(title,year);
–Delete the index using drop index statement
ƒ Selection of Indexes
–Selection vs IUD performance
SQL - Cont
View Definitions
ƒ View does not contains any physical data
–"virtual relation"
ƒ Declaring Views
–Create view <view-name> as <viewdefinition>
ƒ Querying Views
–Use the normal select syntax with a view
name in place of the table name
ƒ Renaming Attributes
–Can map table attribute name from the base
table to a new name in a view definition
ƒ Modifying Views
–Updatable views are useful in special cases;
selective IUDs
Constraints & Triggers
Keys & Foreign Keys
ƒ Primary Keys
–Each relation can have only one primary key
–Primary key attribute(s) can not be NULL
–Two ways to specify the primary key
1) create table MovieStar(
name char(30) primary key,
address varchar(255),
gender char(1),
birthdate date);
2) create table MovieStar(
name char(30),
address varchar(255),
gender char(1),
birthdate date,
primary key(name, birthday));
ƒ Unique Keys
–Each relation can have >1 unique keys
–Declared the same way as primary key
–NULL is allowed
Constraints & Triggers - Cont
ƒ Enforcing Key Constraints
–During insertion or update to the relation
ƒ Foreign-Key
–The "referenced" attribute(s) must be
declared unique or the primary key for their
relation; it must not have a NULL value
–create table Studio(
name char(30) primary key,
address varchar(255),
presC# int references MovieExec(cert#)
);
–create table Studio(
name char(30) primary key,
address varchar(255),
presC# int,
foreign key (presC#) references
MovieExec(cert#)
);
Constraints & Triggers - Cont
ƒ Maintaining Referential Integrity
–Reject Violating Modifications (Default)
Insert or update Studio tuple whose presC# value
is not NULL and is not the cert# component of any
MovieExec tuple
Delete a MovieExec tuple and its cert# component
appears as the presC# component of one or more
Studio tuples
Update a MovieExec tuple cert# value; but the old
cert# is the value of presC# of some movie studio
in Studio
–Cascade Policy
When deleting the MovieExec tuple for the
president of a studio, then it will delete the
referencing tuple from Studio
By changing the cert# value for a MovieExec tuple
from c1 to c2 and there was some Studio tuple
with c1 as the value of its presC# component, then
it will update this presC# component to have the
value c2
Constraints & Triggers - Cont
–Set-Null Policy
Can handle the delete and update problem by
setting the presC# to NULL
e.g. create table Studio (
name char(30) primary key,
address varchar(255),
presC# int references MovieExec(cert#)
on delete set null
on update cascade
);
–Deferring Checking of Constraints
Do selective insert to default the presC# to null
Insert tuple into MovieExec with new cert#
Update the Studio tuple with matching presC#
Use keyword DEFERRABLE and DEFERRED to
delay the checking until the whole tranaction is
"committed"
Reverse the DEFERRED case with keyword
IMMEDIATE
Constraints & Triggers - Cont
Constraints on Attributes and Tuples
ƒ Not-Null Constraints
–Use the NOT NULL keywords in create table
statement for any attribute
ƒ Attribute-Based Constraints
–Use the CHECK keyword in create table
statement
–Limit the value for an attribute
–e.g. gender char(1) check (gender in ('F','M'))
ƒ Tuple-Based Constraints
–Use the CHECK keyword in create table
statement
–Can compose of complex expression of
multiple attributes
Constraints Modification
ƒ Naming Constraints
–In order to change, it must have a name
–Use the CONSTRAINT keyword
Constraints & Triggers - Cont
ƒ Altering Constraints on Tables
–Can use ALTER TABLE to add or drop a
constraint
–Can use SET CONSTRAINTS to set it for
deferred or immediate
Schema-Level Constraints and Triggers
ƒ Assertions (General Constraint)
–A boolean-valued SQL expression that must
be true at all times
–create assertion <name> check
(<condition>)
–e.g.
create assertion RichPres check (not exists
(select * from Studio, MovieExec where
presC# = cert# AND netWorth < 10000000));
ƒ Event-Condition-Action Rules (ECA Rules)
–Triggers are awakened by certain events
–The "action" will be preform only if C = true
Constraints & Triggers - Cont
ƒ Triggers in SQL
–create trigger NetWorthTrigger
after update of netWorth ON MovieExec
referencing
old row as OTuple,
new row as NTuple
for each row
when (OTuple.netWorth > NTuple.netWorth)
update MovieExec
set netWorth = OTuple.netWorth
where cert# = NTuple.cert#;
–Default is "for each statement"
–Besides update, can use insert and delete
–Action can be "before" or "after" the event
–Use BEGIN...END for multiple statements
ƒ Instead-Of Triggers
–Not part of SQL-99
–Replace event with new defined operations
–Very powerful when used on a view
System Aspects of SQL
SQL Programming Environment
ƒ Host language + Embedded SQL
v
Preprocessor
v
Host language + Function calls
v
Host-language compiler <= SQL Library
v
Object-code program
ƒ Impedance Mismatch Problem
–Different data model between SQL
statements and programming langauges
ƒ SQL/Host Language Interface
–Use EXEC SQL keywords in front of an SQL
statement
–Use shared (host) variables for SQL stmt
System Aspects of SQL - Cont
ƒ The DECLARE Section and Its Usage
–Shared variables are declared between two
embedded SQL statements.
–e.g.
EXEC SQL BEGIN DECLARE SECTION;
char studioName[50], studioAddr[256];
char SQLSTATE[6];
EXEC SQL END DECLARE SECTION;
–A shared variable can be used within the
SQL statement by placing a colon in front it.
–e.g.
EXEC SQL INSERT INTO
Studio(name, address)
VALUES (:studioName, :studioAddr);
ƒ Single-Row Select Statement
–e.g.
EXEC SQL SELECT netWorth
INTO :presNetWorth
FROM Studio, MovieExec
System Aspects of SQL - Cont
ƒ Cursors
–Allow programs to "fetch" multiple rows from
a relation
–Here are the steps for using a cursor
EXEC SQL DECLARE <cursor> CURSOR FOR
<query>
EXEC SQL OPEN <cursor>
EXEC SQL FETCH FROM <cursor> INTO <list-ofvariables>
If SQLSTATE is "02000", then goto close
<cursor>; otherwise fetch next row
EXEC SQL CLOSE <cursor>
–Row Modification with Cursor
Use the WHERE CURRENT OF keywords
e.g.
EXEC SQL DELETE FROM MovieExec
WHERE CURRENT OF execCursor;
EXEC SQL UPDATE MovieExec
SET netWorth = 2 * netWorth
WHERE CURRENT OF execCursor;
System Aspects of SQL - Cont
–Concurrent Update of Tuple
Use keywords INSENSITIVE CURSOR to ignore
new changes which may affect the current cursor
Use Keywords FOR READ ONLY to signal that
this cursor does not allow any modification
–Scrollable Cursors
Allow a set of movements within a cursor
ƒ Dynamic SQL
–Flexibility to enter SQL statement at run time
–Use EXEC SQL EXECUTE IMMEDIATE or
( EXEC SQL PREPARE ... and
EXEC SQL EXECUTE ... )
–e.g.
EXEC SQL BEGIN DECLARE SECTION;
char *query;
EXEC SQL END DECLARE SECTION;
/* Allocate memory pointed to by query
and fill in the SQL statement */
EXEC SQL EXECUTE IMMEDIATE :query;
System Aspects of SQL - Cont
Procedures Stored in the Schema
ƒ Persistent Stored Modules (PSM)
–Can build module to handle complex
computations which cannot be expressed
using SQL
ƒ PSM Functions & Procedures
–CREATE PROCEDURE <name> (<param>)
local declarations
procedure body;
–Procedure parameter can be input-only,
output-only, or both
–CREATE FUNCTION <name> (<param>)
RETURNS <type>
local declarations
function body;
–Function parameter can only be input as
PSM forbids side-effects in functions
System Aspects of SQL - Cont
ƒ Statements in PSM
–Call statement
CALL <proc name> (<arg list>);
e.g. EXEC SQL CALL Foo(:x, 3);
–RETURN <expression>;
–DECLARE <name> <type>;
–SET <variable> = <expression>;
–BEGIN ... END
–IF <condition> THEN
<statement list>
ELSEIF <condition> THEN
<statement list>
ELSEIF
...
ELSE <statement list>
END IF;
–SELECT <attr> INTO <var> FROM <table>
WHERE <condition>
–LOOP <statement list> END LOOP;
System Aspects of SQL - Cont
–FOR <loop name> AS <cursor name>
CURSOR FOR
<query>
DO
<statement list>
END FOR;
–Support WHILE and REPEAT loops
ƒ Exception Handler in PSM
–DECLARE <where to go> HANDLER FOR
<condition list> <statement>
–<where to go> can be:
CONTINUE - executing the handler statement and
then execute the next statement after the one
which cause the exception
EXIT - execute the handler statement and then
control leaves the BEGIN...END block in which the
handler is declared
UNDO - same as EXIT except that any changes to
the DB or local variables that were made by the
statements of the block are "undone"
System Aspects of SQL - Cont
SQL Environment
ƒ Schema
–A collection of tables, views, assertions,
triggers, PSM modules, etc
–CREATE SCHEMA <name> <declarations>
–Use SET SCHEMA to change schema name
ƒ Catalog
–A collection of schemas
–CREATE CATALOG <catalog name>
–Use SET CATALOG to change the current
catalog
ƒ Cluster
–A collection of catalogs
–Can be view as a set of all catalogs
accessible to a user
ƒ Client/Server
–Both client and server can be on the different
or the same machine
System Aspects of SQL - Cont
ƒ Connection
–CONNECT TO <server name> AS
<connection name> AUTHORIZATION <name
and password>
–SET CONNECTION <name>;
–DISCONNECT <name>;
Call-Level Interface (CLI)
ƒ In C, each CLI program must include sqlcli.h
where it contains all the function, structure,
constant, and type definitions
ƒ 4 kinds of records: SQLHENV, SQLHDBC,
SQLHSTMT, and SQLHDESC.
ƒ Use SQLAllocHandle(hType, hIn, hOut)
ƒ Processing Statements
–Use SQLPrepare(sh, st, sl); &
SQLExecute(sh);
–or use SQLExecDirect(sh, st, sl);
–Use SQLFetch(sh) from a query result
System Aspects of SQL - Cont
–Use SQLBindCol(sh, colNo, colType, pVar,
varSize, varInfo) for column binding
–Can use SQLGetData(...) in place of
SQLBindCol(...) to extract data from a query
ƒ Passing Parameters to Query
–e.g.
SQLPrepare(myStmt, "INSERT INTO
Studio(name, address) VALUES (?, ?)",
SQL_NTS);
SQLBindParameter(myStmt, 1, ...,
studioName, ...);
SQLBindParameter(myStmt, 2, ...,
studioAddr, ...);
SQLExecute(myStmt);
Transactions in SQL
ƒ Serializability
–Multiple selects followed by multiple updates
to the same tuple; e.g. chooseSeat()
–Use locks to handle this problem
System Aspects of SQL - Cont
ƒ Atomicity
–Single user transaction may have multiple
updates to different tables; e.g. transfer from
account A to account B
–Only "commit" after all the changes are
made
ƒ Transaction
–A collection of one or more operations on the
database that must be executed atomically
–Use START TRANSACTION to begin
–Use SQL COMMIT to commit
–Use SQL ROLLBACK to abort and undo the
changes prior to the start of the transaction
–Can set the transaction to READ ONLY
ƒ Dirty Reads (Uncommitted Reads)
–Data read that were "dirty" or uncommitted
ƒ Isolation Levels
–Serializable, uncommitted read, committed
read, and repeatable-read
System Aspects of SQL - Cont
Security & User Authorization in SQL
ƒ Privileges
–SQL defines nine types of privileges:
SELECT, INSERT, DELETE, UPDATE,
REFERENCES, USAGE, TRIGGER,
EXECUTE, and UNDER
ƒ Authorization Checking
–First at connect time
–Second at statement time
–Additional checks with modules
ƒ Grant & Revoke
–GRANT <privilege list> ON <database
element> TO <user list>
–Allow other user to perform certain actions
–REVOKE <privilege list> ON <database
element> FROM <user list>
–Disallow a previously granted privilege
Data Storage
Megatron 2002 Database System
ƒ Store relation in ASCII text file
ƒ Store the schema also in ASCII file
ƒ Obvious problems:
–Tuple layout on disk is not flexible; any
small change may shuffle the whole file
–Searching is expensive; must read the
whole file
–Query-processing is by brute force; nested
loop to examine all possibilities
–No memory buffering, every query requires
direct access to disk
–No concurrency control
–No reliability; e.g. no crash recovery
The Memory Hierarchy
ƒ Cache Memory
–Fast access to and from processor or I/O
controller
Data Storage - Cont
ƒ Main Memory
–Random access (RAM)
–Both OS and applications reside in RAM
ƒ Virtual Memory
–Allows each application to have their own
private memory space which mapped to
physical memory (RAM) or disk memory
–A page is a memory block used by main
memory to/from disk
ƒ Secondary Storage
–Much slower than main memory
–Two type of disk I/O
Disk read means moving a block from disk to
main memory
Disk write means moving a block from main
memory to disk
–Most DBMS will manage disk blocks itself,
rather than relying on the OS file manager
Data Storage - Cont
ƒ Volatile and Nonvolatile Storage
–Main memory is typically volatile; thus
when the power is off, the content is gone
–Flash memory are nonvolatile but it is very
expensive and currently not used in main
memory
–An alternative is to use "RAM disk"
combine with a battery backup to the power
supply
Disks
ƒ Disk Components
–Head, platter (2 surfaces each), cylinder,
tracks, sectors, gap
ƒ Disk Controller
–Controls the movement of the disk head(s)
to a specific track and preforms reads and
writes
–Tranfers data to and from main memory
Data Storage - Cont
Effective Use of Secondary Storage
ƒ CS studies of algorithm often assumes that
the data are always in main memory; this is
not a valid assumption for DBMS
ƒ I/O Model of Computation
–Dominance of I/O cost
If a block needs to be moved between disk and
main memory, then the time taken to perform
the read/write is much larger than the time for
manipulating that data in main memory; thus the
I/O time is a good approximation of the total
time
Similar to Big O notation for algorithm study
ƒ Sorting Data in Secondary Storage
–If we need to sort 1.64 billion bytes and a
disk block is configured to handle 16384
bytes, then 100000 blocks are required to
read each tuple once from disk
–Quicksort is one of the fastest algorithm
but its assumption is all entries are in
memory
Data Storage - Cont
ƒ Two-Phase, Multiway Merge-Sort (TPMMS)
–Consists of 2 phases
Phase 1: Sort main-memory-sized pieces of
the data, so every record is part of a sorted list
that just fits in the availabe main memory; the
results are a set of sorted sublists on disk
which we merge in the next phase
Phase 2: Merge all the sorted sublists into a
single sorted list
–Example:
If we have 100 MB of main memory using
16384 size block sorting 1.64 billion bytes, we
can fit 6400 blocks at a time in main memory;
thus the results from phase 1 will have 16
sorted sublists
If merge two sublists at a time, we need 8 disk
I/O's performed on it
The better approach is to read the first block
from each of the sorted list into main-memory
buffer. Find smallest element into a output
buffer and flush/reload when necessary.
Data Storage - Cont
Accelerating Access to Secondary Storage
ƒ TPMMS example in 11.4.4 assumed that data
was stored on a single disk and the blocks
were chosen randomly
ƒ There are certainly room for improvement with
the following methods with their advantages
and disadvantages
–Cylinder-Based Organization
Can reduce disk block access time in phase one
of TPMMS by more than 95%
Excellent for applications that has only one
process accessing the disk and block reads are
grouped in logical sequences
Not useful when reading random blocks
–Multiple Disks
Increase both group and random access time
Same disk access can't be parallel
Can be expensive since single large disk is
usually more cost effective than multiple smaller
disks with the same capacity
Data Storage - Cont
–Mirroring
Reduce access time for read/write requests
Built-in fault tolerance for all applications
Must pay for 2+ disks for the capacity of only 1
–Elevator Algorithm
Reduce read/write access time when the blocks
are random
The average delays for each request can be high
for any high-traffic system
–Prefetching/Double Buffering
Greatly improve access when the blocks are
known or grouped together.
Require extra main-memory buffers
No help when accesses are random
Disk Failures
ƒ Intemittent Failures
–An attempt to read or write a sector failed,
but successful after n number of retries
Data Storage - Cont
ƒ Checksum
–Widely used method to detect media errors
–Use a collection of bits to calculate a fixed
number; when the recalculation failed, then a
media error is the likely cause
–Detect errors but does not fix them
ƒ Stable Storage
–Similar to the disk mirroring except that this
is achieved at the software/application level
–Keep an extra "delta" copy of the data to
prevent media error and possible data
corruption caused by power failure
Disk Crash Recovery
ƒ Redundant Arrays of Independent Disks-RAID
ƒ Redundancy Technique - Mirroring
–Known as RAID level 1
–When one of the disk failed, then the other
"mirroring" disk will become the main disk
Data Storage - Cont
ƒ Parity Blocks
–Known as RAID level 4
–Use only 1 redundant disk no matter how
many data disks it may support
–Utilizing the modulo-2 sum for parity checks
–Too many disks can cause the redundant
disk to perform poorly since each disk write in
any n disks can cause the check-sum bits to
change
ƒ RAID 5
–Improve the RAID 4 approach by sharing the
redundant disk workload into all n disks
ƒ Multiple Disk Crash
–RAID 6: use error-correcting codes such as
Hamming code
–Use a combination of data and redundant
disks to determine how many of each are
required to prevent concurrent failures or "n"
disks
Representing Data Elements
Data Elements and Fields
ƒ Relational Database Elements
–Since a relation is a set of tuples, and tuples
are similar to a record/structure in C or C++,
we may imagine that each tuple will be stored
on disk as a record.
ƒ Objects
–An object is like a tuple with its instance
variables are attributes.
ƒ Data Elements
–INTEGER
Usually 2 or 4 bytes long
–FLOAT
Usually 4 or 8 bytes long
–CHAR(n)
Fixed length denoted by n
Representing Data Elements Cont.
–VARCHAR(n)
Variable length with n as the maximum
Two ways to represent varchar
Length plus content
Null-terminated string
–Dates and Times
Date is usually represented as char(n)
Time is represented as varchar(n) because of the
support for fractional of seconds
–Bits
Can pack 8 bits into a byte, use an 8 bits
boundary meaning rounded into the next byte.
–Enumerated Types
 Using a byte to represent each item, thus can
have 256 different values
Representing Data Elements Cont.
Records
ƒ Building Fixed-Length Records
–Can concatenate the fields to form a record
–Be aware of the 4 and 8 bytes boundary
depending the HW and OS; therefore must
organize data accordingly
ƒ Record Headers
–Also known as the record descriptor
–Information about the record such as length,
timestamp, record id, record type, etc.
ƒ Packing Fixed-Length Records into Blocks
–Using a block header followed by multiple
records
Representing Data Elements Cont.
Representing Block and Record Addresses
ƒ Client-Server Systems
–The server's data lives in a database
address space. The address space can refer
to blocks and possibly to offsets within the
block.
Physical Addresses
Byte strings that can determine the location
within secondary storage system where the
block or record can be found. Information such
as hostname, cylinder number, track number,
block number, and offset from the beginning of a
record within a block.
Logical Addresses
Can be view as a flat model where all the
records are in logical sequence in memory.
Use a mapping table to map logical to physical
addresses.
Representing Data Elements Cont.
ƒ Logical and Structured Addresses
–Why logical addressing
Movement of data can be done by changing the
logical to physical mapping table rather than
moving the actual data itself.
–Structured addressing
Using a key value and the physical address of a
block can easily locate a record
Can be view as a form of "hashing"
Fast lookup if the each record is fixed-length
– Offset table
Keeping an offset table as part of a block header
can handle variable length record with fast lookup.
Allow easy movement of data as one of the main
advantage of logical addressing.
Representing Data Elements Cont.
ƒ Pointer Swizzling
–It means to translate the embedded pointers
from secondary (database address) to main
memory (virtual address).
–A pointer usually consists of a bit indicating
whether the pointer is currently a database
address or a (swizzled) memory address; and
the actual database or memory pointer.
–Automatic swizzling means when we load a
block, all its pointers and addresses are put
into the translation table if not already existed.
–Anthoer approach is to translate the pointer
only when it is being used.
–Programmer can control pointer swizzling by
using a look-ahead logic (e.g. prefetch).
Representing Data Elements Cont.
ƒ Pinned Records and Blocks
–A block in memory is pinned if it cannot be
written back to disk safely.
–Can view this as a constraint or dependency
from other pointers in another block.
Variable-Length Data and Records
ƒ Variable-Length Fields Record
–Must keep the length and offsets of the
variable length fields.
–One simple but effective method is to put all
the fixed-length fields in the beginning and
then follow by the variable-length fields.
ƒ Repeating Fields Record
–Use the same method as above but can
move the repeating fixed-length fields to
another block.
Representing Data Elements Cont.
ƒ Variable-Format Records
–In certain situation, records may not have a
fixed schema. The fields or their order are not
completely determined by the relation.
–Use tagged fields to handle such cases.
We stored attribute or field name, the type of the
field, and the length of the field.
Very similar to the SQLDA definition and the bindin and bind-out operations of SQL
ƒ Spanned Records
–When a record can't fit into a block and must
be broken up into multiple blocks, it is called
spanned.
–Spanned records require extra header
information to keep track of their fragments.
Representing Data Elements Cont.
ƒ Binary Large Objects (BLOBS)
–Can hold large audio and image files
–Stored in a sequence of blocks but also can
be striped across multiple disk for faster
retrieval
–Retrieval is usually done in small chunks
Record Modifications
ƒ Insertion
–If order is not important, then just add it to
the end of the free space within a block.
–If order is important, then we must slide the
data to fit the new record. In this case, the
offset table implementation can help reduce
the actual movement of data.
–If the block is full, then either a) find space
on a "nearby" block and keep the forwarding
address or b) use a chain of overflow blocks.
Representing Data Elements Cont.
ƒ Deletion
–When a record is deleted, the system must
reclaim its space.
–If using an offset table, then it should shuffle
the free space to a central location
–An alternative is to keep track of a link-listed
of free space (e.g. the freelist).
–When the record is spanned or flowed to a
different block (nearby or overflow), we may
need to do a "reorganizing" of all the blocks.
–To avoid dangling pointers, we can replace
some of the records with tombstones (dummy
record indicating a dead end).
ƒ Update
–If the record is fixed-length, then there is no
effect on the storage.
–Otherwise we have the same problems as
insertion or deletion of variable-length fields.
Index Structures
Indexes on Sequential Files
ƒ Usually have a data file and an index file. A
data file is a sorted file. An index file contains
only keys and pointers related to the data file.
ƒ Sequential Files
–A file which contains records that were
sorted by the keys defined by its index.
ƒ Dense Indexes
–Every key from the data file is represented in
the index.
–The index entry is small compare to an
record entry. Thus we may be able to keep
the index file content in memory, rather than
read from the index file.
–Since the keys are sorted, we can use binary
search to find the key (K). Search time is
about log n (base 2).
ƒ Sparse Indexes
–Hold only a subset of the dense indexes
–Use much less space but slower search
time.
Index Structures - Cont.
ƒ Multiple Levels of Index
–If an index itself cover many blocks, then a
binary search will still need to do many disk
I/O's to get to the correct record. We can
solve this problem by putting an index on the
index.
–The outer level of the index will be more
sparse compared to the inner level.
ƒ Indexes With Duplicate Search Keys
–If the index is not unique, we can have
multiple records with the same search key.
–An efficient approach is to have only one
record in the dense index for each search key
K. Then find the record within the sorted sublist.
ƒ Managing Indexes During IUDs
–In Chapter 12, recall that different methods
were discussed to handle IUDs of fixed or
variable length records, similar logic applies to
index files as well.
Index Structures - Cont.
ƒ Here is a list of actions on the sequential file
which affect the index file:
Action
Dense
Sparse
------------------------------------------------------------Create <> overflow block
none
none
Delete <> overflow block
none
none
Create <> sequential block
none
insert
Delete <> sequential block
none
delete
Insert record
insert update(?)
Delete record
delete update(?)
Slide record
update
update(?)
–Empty overflow block has no effect because
Index Structures - Cont.
Secondary Index
ƒ This is a dense index, usually with duplicates.
ƒ Does not require the underlying file to be
sorted on the search key like in primary index.
ƒ The keys in the secondary index are sorted.
ƒ The pointers in one index block can go to many
different data blocks.
ƒ Clustered file structure can help manage the
many-one relationship. An example of this is
the number of columns or indexes within a
table.
ƒ Indirection in Secondary Indexes
–Using buckets in this index scheme to avoid
duplicates in the higher level.
–e.g. SELECT title FROM Movie WHERE
studioName = 'Disney' AND year = 1995;
–If studioName is the primary index and year
is the secondary index, then the number of
tuples which satisfy both condition will be
reduced significantly.
Index Structures - Cont.
ƒ Document Retrieval and Inverted Indexes
–The WWW has brought many new
requirements for document retrieval and
pattern match. This results in newer and
better search engine as time passes.
e.g. Search all the documents which contain the
words "database", "programming", and
"implementation".
–A relational view of the Doc search
A document may be thought of as a tuple in a
relation. Each attribute/word can be represent as
a bit and set to true if the Doc has at least one
match. Use a secondary index on each of the
attributes of Doc but only keep entries which has
the search-key value TRUE. Instead of creating a
separate index for each attribute, the indexes are
combined into one, called inverted index.
Each inverted index will point us to the bucket
entry where we can "join" the each list of pointers
to satisfy the three words.
Index Structures - Cont.
B-Trees
ƒ B+ tree
–It automatically maintains as many levels of
index as is appropriate for the size of the file
being indexed.
–It manages the space on the blocks they use
so that every block is between half used and
completely full. No overflow blocks are
needed.
ƒ Structure
–It is balanced which means all paths from the
root to a leaf have the same length/distance.
–It contains the root, interior, and leaf nodes.
–Each node has n search-key and n+1
pointers.
–The keys in leaf nodes are copies of keys
from the data file. These keys are distributed
among the leaves in sorted order, from left to
right.
Index Structures - Cont.
–At a leaf, the last pointer points to the next
leaf block to the right. Unused pointers are
set to null.
–To deal with duplicate keys, the duplicate will
not be a key in the interior nodes.
ƒ Lookup
–Follow the keys until you reach a leaf node.
Then determine whether it exists or not.
ƒ Range Queries
–e.g. select * from R where R.k >=10 and R.k
<= 25.
–Same lookup steps but continue to follow the
leaf's rightmost pointer until the key condition
is not true.
ƒ Insertion and Deletion Into B-Trees
–May require to spilt or merge the leaf and/or
other higher level nodes. Recursively done
from leaf back up to root.
Index Structures - Cont.
ƒ Efficiency of B-Trees
–If each block has 255 pointers, then a 3 level
B-Tree can handle up to 16.6 million pointers
to records.
–Depending of implementation, one may or
may not delete from a B-Tree for further
performance enhancement.
Hash Table
ƒ It contains a hash key and hash function which
determine the correct bucket the record
belongs to.
ƒ Hash function is very important for a balance
distribution.
ƒ Each bucket can be pointed to a block in
secondary storage.
ƒ Use overflow bucket chain when the regular
bucket is full.
Index Structures - Cont.
ƒ Hash Table Insertion and Deletion
–Insert puts a record into a normal or overflow
block.
–Delete removes a record from a normal or
overflow block. It may optionally consolidate
the blocks of a chain into one fewer block.
ƒ Efficiency of Hash Table Indexes
–Design the bucket sizes to fit into one block.
Thus, only one disk I/O is required for lookup
and two disk I/O's for insert and delete.
–Try not to search a chain of blocks in hash
table which causes many disk I/O for a single
lookup.
ƒ Extensible Hash Tables
–Instead of the array consisting of the data
blocks themselves in the static hash tables,
we keep an array of pointers to blocks
representing the blocks.
–Keep the number of bits on each block to
use by the hash function
Index Structures - Cont.
–We may require to split the bucket (double
the size) and redistribute the keys when the
bits used reach maximum.
–The main advantage of the extensible hash
table is the fact that when looking for a
record, we never need to search more than
one data block.
–It may required additional I/O when the
bucket array no longer fit in main memory.
ƒ Linear Hash Tables
–The number of buckets n is always chosen
so the average number of records per bucket
is a fixed fraction, say 80%, of the number of
records that fill one block.
–Since blocks cannot always be split, overflow
blocks are permitted.
–If we exceed the n number of combinations,
and we add 1 to i, meaning adding 1 extra bit
in front of the key since it's 2 to the i power in
this case.
Multidimensional and Bitmap
Indexes
Multiple Dimensional Applications
ƒ Geographic Information Systems (GIS)
–Objects are stored in a typically twodimensional space.
–e.g. DB are maps with houses, roads,
bridges, and other physical objects.
–Examples of the type of queries are:
Partial match queries - Specify values for 1 or
more dimensions and look for all points matching
those values in those dimensions.
Range queries - Specify ranges for 1 or more of
the dimensions and get a set of points within those
ranges.
Nearest-neighbor queries - Ask for the closest
point for a given point.
Where-am-I queries - Given a point, find the
current location. (e.g. xy coordinates in a map)
Multidimensional and Bitmap
Indexes - Cont.
ƒ Data Cubes
–See data as existing in a (multi-) highdimensional space.
–Each tuple is a point in the space. Queries
are used to group and aggregate these points
for decision-support applications.
ƒ Multidimensional Queries in SQL
–A typical relational query:
select day, store, count(*) as totalSales
from sales
where item = 'shirt' and color = 'pink'
group by day, store;
ƒ Range Queries Using Conventional Indexes
–Assuming attribute x and y have their own Btree index file. Then find all x that fit into a
given range and find all y that fit into a given
range and intersect these pointers.
Multidimensional and Bitmap
Indexes - Cont.
ƒ Nearest-Neighbor Queries Using Coventional
Index
–Assuming the x,y coordinates example, the
closest x or the closest y alone can't
determine the actual distance to target point.
ƒ Two Categories of Multidimensional Index
Structures
–Hash-table-like approaches.
–Tree-like approaches.
Hash-Like Structures for Multi-D Data
ƒ Grid Files
–Using a 2 dimensional index example, we
can divide a 2-D grid using grid lines. Thus it
results in a set of grid partitions.
Multidimensional and Bitmap
Indexes - Cont.
ƒ Lookup in a Grid File
–Each region can be viewed as a hash table
bucket and each point in that region is placed
in a block belonging to that bucket.
–To find a point, use the values for x and y to
locate a region and search the data block
within that bucket.
ƒ Insertion Into Grid Files
–Same logic as in lookup to locate the correct
bucket.
–Add overflow blocks as needed.
–Reorganize the structure by adding or
moving the grid lines. This is similar to the
dynamic hashing technique in section 13.4.
ƒ Performance of Grid Files
–Problem arises when in high-dimensional
case where the number of buckets grows
exponentially with the dimension.
Multidimensional and Bitmap
Indexes - Cont.
–The choosing of grid lines is important so
that the data can be evenly distributed.
–Behavior on different types of queries:
Lookup of Specific Points - Just follow the path to
the proper bucket and search.
Partial-Match Queries - Use only the grid line(s)
needed to find the bucket(s) and search.
Range Queries - If ranges is between grid lines,
and including the border buckets in search.
Nearest-Neighbor Queries - If grid lines are not
proportional, then it may have to search buckets
that are not adjacent to the one containing the
point P.
ƒ Partitioned Hash Functions
–Use the set of attribute values to develop the
hash function.
Multidimensional and Bitmap
Indexes - Cont.
–If we take 2 values from x and 3 values from
y, then there are 6 combinations which can be
represented with only 3 bits in the hash key.
ƒ Comparison of Grid Files and Partitioned
Hashing
–Partitioned hash tables are useless for
nearest-neighbor queries or range queries.
But a well chosen hash function will
randomize the buckets evenly. If we only
need to support partial match queries, then
partitioned hash table is likely to outperform
the grid file.
–Grid file is superior for most types of queries.
But when the dimension is very large, there
could be many buckets that are emtpy or
nearly empty. Thus we will need to keep a
larger in-memory structure than necessary.
Multidimensional and Bitmap
Indexes - Cont.
Tree-Like Structures for Multi-D Data
ƒ Multiple-Key Indexes
–Use the root/first level for attribute 1 and the
second level for attribute 2, etc.
–Performance
Partial-Match Queries - If the attributes in the toplevels are specified, then the access is quite
efficient. Otherwise it must search every subindex
which can be time-consuming.
Range Queries - If the individual indexes support
range queries on their attributes, then the query
will be fairly easy.
ƒ kd-Trees (k-dimensional search tree)
–It is a binary tree in which interior nodes
have an associated attribute A and a value V
that splits the data points into two parts, less
than V and greater than or equal to V. The
levels of the trees are filled with rotating
attributes of all dimensions.
Multidimensional and Bitmap
Indexes - Cont.
–Interior nodes have an attribute, dividing
value, and left and right pointers.
–Leaves nodes are blocks with space for as
many records as a block can hold. Spliting a
block requires adding a new interior node.
–Performance
Partial-Match Queries - If a value is available for a
given attribute, then go left or right. Otherwise it
will need to examine both left and right nodes.
Range Queries - If the range straddles the splitting
value at the node, then we must explore both
children. The further it is from the leaf node, the
more possibilities it has to examine.
Nearest-Neighbor Queries - Treat the problem as
a range query and repeat with a larger range if
necessary.
–Can optimize kd-tree by using B-tree nodes
as its interior nodes. Alternatively it can pack
the interior nodes into a single block, thus it
reduces the number of disk I/O's.
Multidimensional and Bitmap
Indexes - Cont.
ƒ Quad Trees
–Assuming that N number of records fit into a
block. Starting from the root, if it has more
than N records, then it divide into 4 branches,
try to spilt them according to certain values of
an attribute pair. Recursively until all the
nodes satisfy the N record requirement.
ƒ R-Trees (region tree)
–Use data regions (leaf nodes) to represent
data block(s) and interior region (region) as
links to the data regions.
–Keep overlapping to a minimal.
Bitmap Indexes
ƒ A bitmap index for a field F is a collection of bitvectors of length N, one for each possible
value that may appear in the field F. For
example, if there are 6 records, we use 6 bits.
Multidimensional and Bitmap
Indexes - Cont.
ƒ Motivation for Bitmap Indexes
–Questions: too much space? dependent of
the number of records?
–A major advantage of bitmap indexes is that
they allow us to answer partial-match queries
very efficiently. Find that proper vectors for
attributes A and B, then take the bitwise AND
of these vectors.
–For range queries, we take the vectors in the
range of A and bitwise OR them together.
Second we take the vectors in the range of B
and bitwise OR them together. Lastly we
bitwise AND the two results and we have
identified the correct records which satisfy the
query.
ƒ Compressed Bitmaps
–Use an encoding algorithm to reduce the
number of bits required in storage.
Multidimensional and Bitmap
Indexes - Cont.
ƒ Managing Bitmap Indexes
–Finding Bit-Vectors
We can use B-trees or hash tables to locate the
key-pointer pairs which leads us to the bit-vector
for the key value.
–Finding Records
Once a specific record is found, we can use a
secondary index on the data file whose search key
is the record number.
–Handling Modifications to the Data File
Record numbers must remain fixed once
assigned. If a record is deleted, then it must be
replaced by a "tombstone".
Inserting a value which has a corresponding bitvector will just turn on the new bit. If the value is
new, then it will create a bit-vector and add it to the
secondary-index that is used to find the bit-vector.
Updating a value from one bit-vector to another
will need to update both the bit-vectors.
Query Execution
Parts of the Query Processor
query
v
metadata-> Query Compilation
v <--> query plan
Query
Execution
^
+
v
data
Query Execution - Cont.
ƒ Scanning Tables
–The most basic thing we can do in a physical
query plan is to read the entire contents of a
relation R.
–Two approaches to scan tuples of a relation.
Table-scan - Read the blocks containing all the
tuples of R.
Index-scan - An index containing attribute A of the
R can be used to get all the tuples of R.
ƒ Sorting While Scanning Tables
–An ORDER BY clause will require the tuples
to be sorted. Additionally, various algorithms
for relational-algebra operations require one
or both arguments to be sorted relations.
–The physical-query-plan operator sort-scan
takes a relation R and a set of attributes on
which the sort is to be made, and produces R
in that sorted order.
Query Execution - Cont.
–Several ways to implement sort-scan
If there is an index to the attribute(s), then an
index-scan will produce R in the desired order.
If all the tuples can fit into main memory, then we
can do a table-scan and apply one of the efficient
main-memory sorting algorithms.
If R is too large to fit in main-memory, then we can
apply the TPMMS algorithm to generate the
results to memory rather than to disk.
ƒ Model of Computation for Physical Operators
–To estimate the cost for each operator, we
shall use the number of disk I/O's as the
measuring factor for an operation.
–When comparing algorithms for the same
operations, we assume that the arguments of
any operator are found on disk, but the result
of the operator is left in main memory.
–In most cases, the result from an operator is
not written to disk but it is passed in memory
from one operator to another.
Query Execution - Cont.
ƒ Parameters for Measuring Costs
–The parameters are usually called statistics.
Incorrect statistics can fool the query
optimizer to choose the wrong plan, thus it
may impact the performance of the query.
–There are 3 parameter families: B,T, and V
B(R) denotes the number of blocks that are
needed to hold all the tuples of R. We usually
assume that R is clustered (nearby blocks).
T(R) or T or denotes the number of tuples in R.
Then T(R) / B(R) is the ratio of tuples per block.
V(R,a) denotes the number of distinct values that
appear in a column of a relation.
ƒ Iterator
–It is a group of three functions that allows a
consumer of the result of the physical
operator to get the result one tuple at a time.
–The 3 functions are Open, GetNext, and
Close.
Query Execution - Cont.
Open - Initialize any data structures needed to
perform the operation and calls Open for any
arguments of the operation. It does not actually
get any tuples.
GetNext - Returns the next tuple in the result. If
there are no more tuples, then return NotFound.
Close - Ends the iteration after all required tuples
have been consumed or obtained.
–Iterators are used to implement many of the
physical operators such as a table-scan.
One-Pass Algorithms for DB Operations
ƒ Three classes of algorithms for operators,
sorting-based, hash-based, and index-based.
ƒ There are one, two, and three+ pass
algorithms depending on the data size. An
example of two-pass algorithm is TPMMS.
ƒ There are three classifications of operators.
Query Execution - Cont.
–Tuple-at-a-time, unary operations - selection
and projection do not require an entire
relation, we can read a block at a time and
produce our output.
–Full-relation, unary operations - these one
argument operations require seeing all or
most of the tuples in memory at once. An
example is the grouping operator.
–Full-relation, binary operations - All other
operations such as union, intersection,
difference, joins, and products.
ƒ Tuple-at-a-time Operations
–This can be a table-scan or index-scan. If
the selection is done based on an attribute in
an index, then only a subset of tuples will be
return.
–Can improve performance by disk clustering.
Query Execution - Cont.
ƒ Unary, Full-Relation Operations
–Duplicate Elimination
Read one entry at a time. If it is a duplicate, then
skip; otherwise add it in memory.
If the number of tuples are large, then the memory
may not be able to store all the unique entries.
Must use an efficient data structure such as BST
or hash table for searching duplicate.
–Grouping
For MIN(a) or MAX(a), just save the minimum or
maximum value while looping thru all the tuples.
For COUNT, add one for each tuple of the group
that is seen.
For SUM(a), add the value of attribute a to the
total for its group.
For AVG(a), do both COUNT and SUM(a) at the
same time. Then take the quotient of the sum and
count to get the average.
Query Execution - Cont.
ƒ One-Pass Algorithms for Binary Operations
–Use a B or S next to the operators to denote
whether it is a bag or set version of the
operators.
–Bag Union
Read all B(R) + B(S) and output all the results.
–Set Union
Read table S into M-1 buffers in main memory and
build a search structure. Read each block of table
R into the Mth buffer one at a time. For each tuple
t of R, if t is in S, skip, else copy.
–Set Intersection
Same logic as set union except that if each tuple t
of R is found in S, copy, else skip.
–Set Difference
Assuming R is the larger relation, read S into
memory.
R-S: for each t in R, if in S, ignore, otherwise copy.
S-R: for each t in R, if in S, delete t from S in
memory, else do nothing.
Query Execution - Cont.
–Bag Intersection
Read table S into M-1 buffers in main memory and
build a search structure. If there is any duplicates
from S, we keep only one copy but with a counter.
Read each block of table R into the Mth buffer one
at a time. For each tuple t of R, if t is in S, copy
and decrement the counter if the counter is not 0;
otherwise skip.
–Bag Difference
S-R: Read distinct tuples of S into memory with a
count. For each t in R, if t is in S, decrement
count. At the end, copy the tuples with a count of
greater than 0.
R-S: Read distinct tuples of S into memory with a
count. For each t in R, if t is not in S, copy to
output. Otherwise if count=0, copy t to output; if
count>0, do not copy t but decrement count.
–Product
Read S into M-1 buffers in memory. Read each
block of R into the Mth buffer. For each t in R,
concatenate t with each tuple of S to form the
output set.
Query Execution - Cont.
–Natural Join
If we have R(x,y) and S(y,z) with y being common
in both relations, then to compute the natural join,
do the following.
Read all the S into M-1 buffers of memory using
y as the search key, probably with a balance tree
or hash table.
Read each block of R into the remaining buffer.
For each tuple t of R, find the tuples of S that
agree with t on all attributes of y using the
search structure. For each matching tuple of S,
output the join result of this combination.
Nested-Loop Joins
ƒ Tuple-Based Nested-Loop Join
–If we have R(x,y) and S(y,z), in this algorithm
we compute the join as follows:
FOR each tuple s in S DO
FOR each tuple r in R DO
IF r and s join to make a tuple t THEN
output t;
Query Execution - Cont.
ƒ Block-Based Nested-Loop Join Algorithm
–Improvement over the tuple-based method
Orgranizing access to both argument relations by
blocks.
Using M-1 blocks to store tuples belonging to the
Relation which is part of the outer loop.
Two-Pass Algorithms Based on Sorting
ƒ Same idea as TPMMS
–Read M blocks of R into main memory.
–Sort these M blocks using an efficient mainmemory sorting algorithm.
–Write the sorted list into M blocks of disk.
–Use a second pass to merge the sorted
sublists to execute the desired operator.
ƒ Duplicate Elimination
–In second pass, just copy the output without
the duplicates.
Query Execution - Cont.
ƒ Grouping and Aggregation
–In first pass, sort each M blocks using the
grouping attributes of L as the sort key.
–In second pass accumulate the needed
aggregates based on each sort key of v.
–When all the tuples of sort key v is done,
output a tuple consisting of the result for this L
group.
ƒ Union
–Bag unions do not require two passes.
–Set union works very similar to duplicate
elimination except the following:
In pass one, sort both relations, R and S.
In pass two, make sure all the sublist of both R
and S are read in M and merged without
duplicates until either R or S is empty, then finish
the merging of the remaining relation.
Query Execution - Cont.
ƒ Intersection and Difference
–Pass one remains the same for all cases.
The output conditions are determined in pass
two.
Set intersection - output t if it appears in both R
and S.
Bag intersection - output t the minimum of the
number of times it appears in R and S.
Set difference - output t if and only if it appears in
R but not in S.
bag difference - output t the number of times it
appears in R minus the number of times it appears
in S.
ƒ Simple Sort-Based Join Algorithm
–Given R(X,Y) and S(Y,Z), sort R and S
independently using TPMMS with sort key Y.
–Merge the sorted R and S, only output tuples
from R and S in sorted order if they both
satisfy the sort key value of y. We need two
large buffers for this merge.
Query Execution - Cont.
ƒ More Efficient Sort-Based Join
–Can improve the simple join by sorting into
sublists for both R and S.
–Bring the first block of each sublist into a
buffer assuming there are no more than M
sublists in all.
–Repeat the same aglorithm for M buffers
rather than just two.
Two-Pass Algorithms Based on Hashing
ƒ Partitioning Relations by Hashing
–Initialize M-1 buckets using M-1 empty
buffers. Read each one at time into M. For
each tuple t in M, copy t to one of the hash
bucket based of h(t).
–When a bucket is full, write to disk and
continue until all the tuples have been read.
Query Execution - Cont.
ƒ Duplicate Elimination
–Since the same key value will be hash into
the same bucket, then process one bucket at
a time and output the results.
–If M is not large enough to hold the distinct
tuples, then one extra pass may be needed.
ƒ Grouping and Aggregation
–Change the hash function to depend on the
grouping attributes of the list L.
–On the second pass, we only read one
record per group as we process each bucket.
ƒ Union, Intersection, Difference
–We must use the same hash function for
both R and S.
–Apply the one-pass algorithm depending on
the operator.
ƒ Hash-Join Algorithm
–Use the join attribute(s) in hash function.
–Join R and S with corresponding buckets.
Query Execution - Cont.
Index-Based Algorithms
ƒ Clustering and Nonclustering Indexes
–A relation is "clustered" means the tuples in a
relation are packed together, minimize the
number of blocks used.
–Clustering indexes mean the search key is
packed into the least amount of blocks.
ƒ Index-Based Selection
–An index-scan can significantly reduce the
number of tuples. Thus it can remove the
need for the two-pass algorithms.
ƒ Joining by Using an Index
–The bitmap index structure is an example of
unordered index join.
ƒ Joins Using a Sorted Index
–Zig-zag join happens when we have sorting
indexes on Y for both R and S, then all the
tuples will not need to be retrieved during the
intermediate steps.
Query Execution - Cont.
Buffer Management
ƒ Accept requests for main-memory access to
disk blocks. Obviously not all of M are
available during query processing.
ƒ BM Architecture
–Two approaches
Controls main memory directly, common in many
relational DBMS's. e.g. It reserves a fixed size
buffer pool based on a configuration variable at
startup time.
Uses virtual memory, let the OS to decide which
buffers are actually in main memory or in swap
space on disk.
ƒ BM Strategies
–The buffer-replacement strategies are similar
to those used such as the OS scheduler.
–Least-Recently Used (LRU)
–First-In-First-Out (FIFO)
Query Execution - Cont.
–The "Clock" Algorithm
Line up all the buffers in a circle. When we read
into a buffer, set the flag to 1. Default is 0.
Look for the next 0 block, if you pass a block with
flag=1, then set it to 0 and continue the search.
–Pinned Blocks
Change the LRU, FIFO, Clock policy to allow a
block to be pinned (possibly with timeout).
In the Clock, we can use the flag as a reference
count and decrement when it's not being used.
ƒ Physical Operator Selection and BM
–The query optimizer will select a set of
physical operators that will be used to execute
a given query.
–The selection may assume that a certain
number of buffers M is available at that time
but it is not guaranteed by BM.
Query Execution - Cont.
–Impact of algorithms when M shrinks
In sort-based algorithms for certain operators, we
still can function by reducing the size of a sublist.
If there are two many sublists, it can cause a
merging problem.
Main-memory sorting of lists are not affected.
In hash-based algorithm, we can reduce the
number of buckets but it could be a problem if
each bucket becomes so large that they don't fit in
main memory. Once it is started, the number of
buckets remain fixed throughout the first pass.
Algorithms Using More Than Two Passes
ƒ Multipass Sort & Hash-based Algorithms
–A natural extension of the two-pass
algorithms. Instead of 2 passes, you have k
number of passes and M number of buffers.
–Performance of Multipass Algroithms are still
depend on the number of disk/buffer I/O.
Query Execution - Cont.
Parallel Algorithms for Relational Operations
ƒ Models of Parallelism
–With p number of processors, M number of
memory buffers, and d number of disks, here
are three most important classes of parallel
machines.
Shared Memory - If each processor has its own
cache, then it's important for a processor to
ensure the data required is indeed in its cache,
otherwise it must fetch from shared memory.
Shared Disk - Each processor has its own private
memory which is not shared among other
processors. All the disks are shared and therefore
the burden is put on the disk controllers for
competing requests.
Shared Nothing - Each processor has its own
memory and disk(s). It is very efficient unless one
processor have to communicate with another. It is
usually handle with message Qs.
Query Execution - Cont.
ƒ Tuple-at-a-Time Operations in Parallel
–If the relation R is evenly divided among d
disks with p number of processors, when d &
p is approximately equal, this can achieve
maximum performance during projection
operator. On the other hand, a selection
operator may or may not benefit from this
setup.
ƒ Parallel Algorithms for Full-Relation Operations
–We can further change the algorithms to
process the non-dependent components of
any operator.
ƒ Performance of Parallel Algorithms
–The idea here is to reduce the time spent to
1/p where p is the number of processors.
–Hashing is best suited to help distributing the
relation R into multiple disks using p
processors.
The Query Compiler
Parsing
ƒ Syntax Analysis and Parse Trees
–A parser translates language text such as
SQL and convert it to a parse tree.
Atoms - basic lexical elements such as symbols,
keywords, identifiers, etc. It has no children.
Syntactic categories - represented by triangular
brackets around a descriptive name. It contains
rules which are composed of another syntactic
category or atoms.
ƒ Grammar for a Simple Subset of SQL
–<Query> ::= <SFW>
–<Query> ::= ( <Query> )
–<SFW> ::= select <SelList>
from <FromList>
where <Condition>
–<SelList> ::= <Attribute>, <SelList>
–<SelList> ::= <Attribute>
The Query Compiler - Cont.
–<FromList> ::= <Relation> , <FromList>
–<FromList> ::= <Relation>
–<Condition> ::= <Condition> and <Condition>
–<Condition> ::= <Tuple> in <Query>
–<Condition> ::= <Attribute> = <Attribute>
–<Condition> ::= <Attribute> like <Pattern>
–Example:
–StarsIn(movieTitle, movieYear, starname)
–MovieStar(name, address, gender, birthdate)
–From the following query:
SELECT movieTitle
FROM StarsIn
WHERE starName IN
(
SELECT name
FROM MovieStar
WHERE birthdate LIKE '%1960'
);
The Query Compiler - Cont.
ƒ The Preprocessor
–If a view is specified instead of a table, it
substitutes the table in the from-list with the
view definition.
–Responsible for semantic checking.
Check for valid relation uses
Check and resolve attribute uses
Check data types
–Once the parse tree passes all the tests, it is
given to the logical query-plan generator.
Algebraic Laws for Improving Query Plans
ƒ Commutative and Associative Laws
–Commutative - order not important
e.g. x + y = y + x OR x * y = y * x
–Associative - grouping
e.g. (x + y) + z = x + (y + z) OR
(x * y) * z = x * (y * z)
The Query Compiler - Cont.
ƒ Laws Involving Selection
–Splittng laws in multi-condition cases
SEL c1 AND c2 (R) = SEL c1(SEL c2(R))
SEL c1 OR c2 (R) = (SEL c1(R)) us (SEL c2(R))
–Laws that allow us to push selections thru
the binary operators: product, union,
intesection, difference, and join.
For a union, the selection must be pushed to both
arguments.
For a difference, the selection must be pushed to
the first argument and optionally the second.
For other operators, it is only required that the
selection be pushed to one argument
Union law:
SEL c(R u S) = SEL c(R) u SEL c(S)
Difference law:
SEL c(R - S) = SEL c(R) - S
or
SEL c(R - S) = SEL c(R) - SEL c(S)
The Query Compiler - Cont.
For product, intesection, and joins; if R has all the
attributes mentioned in c, then we can write:
SEL c(R x S) = SEL c(R) x S
or if c has only attributes of S then:
SEL c(R x S) = R x SEL c(S)
or if R and S happen to have all attributes of C,
then we can use the following law:
SEL c(R n S) = SEL c(R) n SEL c(S)
ƒ Pushing Selections
–One main strategy is to push down the
selection operators to the lowest level as
possible.
–This strategy is to reduce the resulting data
set being passed upward in a parse tree.
ƒ Laws Involving Projection
–We may introduce a projection anywhere in
an expression tree, as long as it eliminates
only attributes that are never used by any of
the operators above and are not in the result
set of the entire expression.
Coping With System Failures
Issues and Models for Resilient Operation
ƒ Failure Modes
–There are many different things that can go
wrong while a database is queried and
modified. Here are several types of failures.
Erroneous Data Entry
These errors can be dealt with using SQL
features such as constraints and triggers.
Media Failures
Use one of the RAID schemes.
Maintain an archive/backup of the database and
its logs for recovery.
Use distributed redundant copies of the
database run simultaneously.
Catastrophic Failure
Events such a explosions, fires, or vandalism at
the location of the database.
Archive and redundant copies method can be
used for recovery.
Coping With System Failures Cont.
System Failure
The processes that query and modify the
database are called transactions. There are a
few cases where the transactions are not
completed and their transaction states are lost.
Such events include power failure, operating
system crashes, and software crashes.
The recovery of this type of failures required all
the database transactions to be logged. When
the DBMS is restarted, these logs are used to
"undo" the uncommitted transactions.
ƒ Transactions (Ts)
–Relationship between the log manager and
the transaction manager:
QP --- TM --- LM
\ I /
BM ----- RM
I
Disk(Data & Log)
Coping With System Failures Cont.
ƒ Correct Execution of Transactions
–Basic elements of a T include relations, disk
blocks or pages, and individual tuples.
–The database can be in a consistent or
inconsistent state.
The Correctness Principle stated that if a T
executes and there are no other Ts or system
errors, and it starts in a consistent state. Then
when it ends, it remains in the consistent state.
–A T is atomic, all or nothing. If only part of a
T is executed, then the DB is likely to be
inconsistent.
–Simultaneous Ts for the same database
elements can results in an inconsistent DB
unless the proper control or locking
techniques are used.
ƒ Primitive Operations of Transactions
–There are three layers of addressing space.
Disk blocks <-> M <-> local transcation address
Coping With System Failures Cont.
Here are some sample primitives of a transaction
given the assumption that X fit into a single disk
block.
INPUT(X) - copy the disk block containing DB
element X to a memory buffer
READ(X,t) - copy the DB element X to the
transaction's local variable t. If X is not in
memory, then do INPUT(X) first.
WRITE(X,t) - copy the value of local variable t to
the DB element X in a memory buffer. If X is not
in memory, then do INPUT(X) first.
OUTPUT(X) - copy the block containing X from
its buffer to disk.
For example, if a T contains steps to READ(X,t),
WRITE(X,t), OUTPUT(X) in this sequence, the
system crashes before the OUTPUT(X) is
completed. Then this DB relation is in an
inconsistent state. Thus it will require the proper
log files to undo or redo the operation(s).
Coping With System Failures Cont.
Undo Logging
ƒ A log is a sequence of log records. Each T can
have multiple log entries. In most cases, there
are many concurrent Ts; therefore log records
are not necessary in order of a single T.
ƒ Log Records
–There are several forms of log record
discussed in this chapter:
<START TID> - this record indicates that T TID
has begun.
<COMMIT TID> - this indicates that T TID has
completed successfully. Changes to the DB
should appear on disk after commit but the BM will
decide when exactly to write/flush M to disk.
 <ABORT TID> - TID failed. LM ensures that no
changes will be written to disk.
<TID,X,v> - this indicates an update record. This
log record corresponds to a WRITE action, not a
OUTPUT action.
Coping With System Failures Cont.
ƒ Undo-Logging Rules
–U1: If T modifies DB element X, then the log
record of the form <TID,X,v> must be written
to disk before the new value of X is written.
–U2: If a T commits, then its COMMIT log
record must be written to disk only after all the
DB elements changed by the transaction have
been written to disk.
–The sequence of steps for a T are listed in
the following order.
a) log record(s) indicating changed DB elements
b) actual changed DB elements
c) COMMIT log record
–The LM will need a flush-log command to
inform the BM to write the logs to disk.
–If there are two Ts updating the same
element, a locking scheme must be in place
to prevent an element to be written to disk
prematurely.
Coping With System Failures Cont.
ƒ Recovery Using Undo Logging
–The first task of the RM is to separate the
committed and uncommitted records. e.g. If
there was an <START TID> entry, then it
expected to find an <COMMIT TID> entry.
Otherwise the T is considered uncommitted.
–Once we know all the uncommitted TID, then
we start going thru the T entries in a reversed
order usually order by time. There are a few
different scenarios.
If the <COMMIT TID> record got to disk before the
crash. Everything is OK.
If the <COMMIT TID> is not on disk but the
changes were, then we need to undo the changes
and write the <ABORT TID> to log.
If the any element changes did not make it to disk,
then we just need to abort the T without the undo
step.
Coping With System Failures Cont.
ƒ Checkpointing
–We use checkpoint to elimilate the need to
scan the entire log file.
–One technique is to stop accepting new Ts,
wait until all the active Ts have a COMMIT or
ABORT record on the log, flush the log to
disk, write a log record <CKPT>, and flush the
log again, at last resume accepting Ts.
ƒ Nonquiescent Checkpointing
–The normal checkpoint method required all
incoming Ts to be paused. In most cases, this
behavior is unacceptable in a production
environment.
–Allow incoming Ts while we write a log record
<START CKPT (T1, T2, ...Tk)> and flush the
log. T1, ..., Tk are the names for all the active
transactions. When all the T1, ... Tk have
completed or aborted, write a log record
<END CKPT> and flush the log.
Coping With System Failures Cont.
Redo Logging
ƒ Differences between Undo and Redo Logging
–Undo logging cancels the effect of
incomplete Ts and ignores committed ones
during recovery, redo logging ignores
incomplete Ts and repeats the changes made
by committed Ts.
–Undo logging requires us to write changed
DB element to disk before the COMMIT log
record reaches disk, redo logging requires
that the COMMIT record appear on disk
before any changed values reach disk.
–Undo logging needs the old values of
changed DB elements to recover using U1
and U2 rules. Redo logging needs the new
values instead.
Coping With System Failures Cont.
ƒ Redo-Logging Rule
–R1: Before modifying any DB element X on
disk, it is necessary that all log records
pertaining to this modification of X, including
both the update record <TID,X,v> and the
<COMMIT TID> record, must appear on disk.
–The sequence of steps for a T are listed in
the following order in the redo context.
a) log record(s) indicating changed DB elements
b) COMMIT log record
c) actual changed DB elements
ƒ Recovery With Redo Logging
–1) Identify the committed Ts.
–2) Scan the log forward from the beginning.
For each log record <TID,X,v> encountered:
a) If TID is not a committed T, do nothing.
b) If TID is committed, write value v for DB
element X.
Coping With System Failures Cont.
–3) For each incomplete transaction TID, write
an <ABORT TID> record to the log and flush
the log.
ƒ Checkpointing a Redo Log
–1) Write a log record<START CHPT
(T1,...Tk)>, where T1,...Tk are all the active
(uncommitted) Ts, and flush the log.
–2) Write to disk all DB elements that were
written to buffers but not yet to disk by Ts that
had already committed when the START
CKPT record was written to the log.
–3) Write an <END CKPT> record to the log
and flush the log.
ƒ Recovery Using a Checkpointed a Redo Log
–If the last CHPT record is a END CKPT:
We know that all the Ts committed before the
matching START CKPT has its changes written to
disk.
Coping With System Failures Cont.
Any T(i) in the T1...Tk list or that started after the
START CKPT can still have changes it made not
yet migrated to disk, even though the T has
committed. In this case, we follow the recovery
steps described earlier.
–If the last CKPT record is a START CKPT:
We cannot be sure that committed Ts prior to the
start of this checkpoint had their changes written
to disk. Therefore we find the last END CKPT
record and repeat the previous step.
Undo/Redo Logging
ƒ Undo/Redo Rules
–It has the same type of log records as the
other kinds of log except for the update record
which has the format of <TID,X,v,w>. Value v
is the original and w is the new value.
–UR1: <TID,X,v,w> record must appear on
disk before DB element X is modified on disk.
Coping With System Failures Cont.
ƒ Recovery With U/R Logging
–Redo all the committed Ts in the order
earliest-first and undo all the incomplete Ts in
the order latest-first.
–These two steps allow flexibility in which the
order of the COMMIT log records written.
ƒ Checkpointing an U/R Log
–1) Write a <START CKPT(T1,...Tk)> record
to the log, where T1...Tk are all active Ts, and
flush the log.
–2) Write to disk all the buffers that are dirty.
Unlike redo logging, we flush all buffers, not
just those written by committed Ts.
–3) Write an <END CKPT> record to the log,
and flush the log.
–The only requirement is that a T must not
write any values until it is certain not to abort.
Coping With System Failures Cont.
Protecting Against Media Failures
ƒ The Archive
–Full dump/backup means the entire DB is
copied.
–Incremental dump/backup means only those
DB elements changed since the previous full
or incremental backup are taken.
ƒ Nonquiescent Archiving
–Since a static backup requires the DB to be
stopped for a period of hours if the DB is
large, therefore we need to able to take a
"online" backup while it is stills processing Ts.
–As long as the log for the duration of the
backup is preserved, then the discrepancies
(between the time it started the backup and
the completion time) can be corrected from
the log.
Coping With System Failures Cont.
–Steps for making an archive (backup image)
Write a log record <START DUMP>.
Perform a checkpoint appropriate for whichever
logging method is being used.
Perform a full or incremental dump of the data
disk(s), making sure that the copy of the data has
reached a secure place, e.g. remote site.
Make sure that enough of the log has been copied
to the secure place and at least the prefix of the
log up to and including the checkpoint in step 2 will
survive a media failure of the DB.
Write a log record <END DUMP>.
–At the completion of the backup, it is safe to
throw away log prior to the beginning of the
checkpoint previous to the one performed.
Note that the DB administrator may want to
save multiple backups, therefore old backup
images and logs may be saved for a while.
Coping With System Failures Cont.
ƒ Recovery Using an Archive and Log
–1) Restore the database from archive
Find the most recent full backup and reconstruct
the DB from it.
If there are any later incremental backups, apply
them with the earliest first.
–2) Modify the database using the log files.
Use the method of recovery appropriate to the
log method being used.
Concurrency Control
Serial and Serializable Schedules
ƒ Schedules
–A schedule is a time-ordered sequence of
the important actions taken by one or more
Ts.
–Actions are viewed as the READ, and
WRITE operations to/from the BM, not the
INPUT and OUTPUT actions to/from disk.
ƒ Serial Schedules
–It is defined by all the actions of any single T
are group together and then follow by another
single T group of actions.
ƒ Serializable Schedules
–A schedule is serializable if its effect on the
DB state is the same as that of some serial
schedule, regardless of what the initial state
of the DB is.
Concurrency Control - Cont.
ƒ Effect of Transaction Semantics
–Since the scheduler does not care/check the
semantics of a T, therefore it might not know
whether any given T will modify a certain DB
elements.
–Scheduler can be smarter by just knowing
whether a T intends to modify any DB
elements.
ƒ Notation for Transactions and Schedules
–1) An action is an expression of the form
r1(X) or w1(X), meaning that T1 reads or
writes, respectively, the DB element X.
–2) Ti is a sequence of actions with subscript
i.
–3) A schedule (denoted as S) of a set of Ts
(denoted as TT) is a sequence of actions, in
which for each Ti in TT, the actions of Ti
appear in S in the same order that they
appear in the definition of Ti itself. S is an
interleaving of the actions of all the their Ts.
Concurrency Control - Cont.
Conflict-Serializability (C-S)
ƒ Conflicts
–If a pair of consecutive actions in a S such
that, if their order is swapped, then the
behavior of at least one of the Ts involved can
change.
–If we have Ti and Tj, then
1) ri(X);rj(Y) is never a conflict, even if X=Y.
2) ri(X);wj(Y) is not a conflict if X!=Y
3) wi(X);rj(Y) is not a conflict if X!=Y
4) wi(X);wj(Y) is not a conflict if X!=Y
–Situations where we may not swap the order
of actions
Two actions of the same T, e.g. ri(X);wi(Y).
Two writes of the same X by different Ts, e.g.
wi(X);wj(X).
Read and write of the same X by different Ts, e.g.
ri(X);wi(X) or wi(X);rj(X).
Concurrency Control - Cont.
–Two actions of different Ts may be swapped
unless they involve the same DB element,
and at least one is a write.
–The scheduler can make as many swaps as
possible with the goal of making the schedule
into a serial schedule.
ƒ Precedence Graphs and a Test for C-S
–A graphical way to detect the C-S condition
based on whether the conflict can be
resolved.
Enforcing Serializability by Locks
ƒ Locks
–A simple locking scheme is to request a lock
on a DB element before performing any
operation on it.
–A lock table can be used to manage locks for
all the elements within a DB.
Concurrency Control - Cont.
–Consistency of Ts
A T can only read or write an element if it
previously requested a lock on that element and
hasn't yet released the lock.
If a T locks an element, it must later unlock that
element.
–Legality of Schedules
Two Ts cannot lock the same element without one
having first released the lock.
–Notation for locks
li(X): T Ti requests a lock on DB element X.
ui(X): T ti releases/unlocks its lock on DB element
X.
ƒ Locking Scheduler
–It utilizes a lock table to manage locks. In
this section, this can be viewed as a relation
of Locks(element, TID), consisting of pairs
(X,TID) such that TID has a lock on DB
element X. Simple SQL insert and delete will
be sufficient to manage this table.
Concurrency Control - Cont.
ƒ Two-Phase Locking (2PL)
–In every T, all lock requests precede all
unlock requests. The first phase obtains the
locks and releases them at the end of the
second phase.
–It guarantee that a legal schedule of
consistent Ts is C-S.
–Does not solve possible deadlock conditions.
Different Lock Modes
ƒ Shared and Exclusive Locks
–Shared lock are known as "read lock" since
two read actions on the same DB element do
not create a conflict.
–Exclusive lock are known as "write lock"
since any writes to the DB element will affect
the behavior of other reads and writes
requests.
Concurrency Control - Cont.
–Three kinds of requirements
Consistency of Ts:
A read action r(X) must be preceded by sl(X) or
xl(X), with no intervening u(X).
A write action w(X) must be preceded by xl(X),
with no intervening u(X).
Two-phase locking of Ts:
Locking must precede unlocking. No action
sl(X) or xl(X) can be preceded by an action u(Y)
for any Y.
Legality of schedules:
An element may either be locked exclusively by
one T or by several Ts in shared mode, but not
both.
If xl,i(X) appears in a schedule, then there
cannot be a following xl,j(X) or sl,j(X), for some j
other than i, without intervening u,i(X).
If sl,i(X) appears in a schedule, then there
cannot be a following xl,j(X), for j!=i, without an
intervening u,i(X).
Concurrency Control - Cont.
ƒ Upgrading Locks
–Any T can request a S lock first and then
request a X lock later when it is ready to write.
This allows more concurrency, less blocking
time, thus improve performance.
–It can lead to a deadlock
ƒ Update Locks
–An update lock ul,i(X) gives T only the
privilege to read X, not to write X. Once an U
lock is obtained, then nobody can request
ANY locks on X. Only an update lock can be
upgraded to a X lock.
–U lock can help to avoid a deadlock
scenario.
ƒ Increment Locks
–Action INC(A,c) can be defined as follow:
READ(A,t); t:=t+c; WRITE(A,t);
–Use il,i(X) notation for an increment lock.
Concurrency Control - Cont.
–A consistent T can only have an increment
action on X if it holds an I lock on X at the
time. An I lock does not enable either read or
write actions.
–In a legal schedule, any number of Ts can
hold an I lock on X at any time. If an I lock on
X is held by a T, then no other T can hold
either a S or X lock on X at the same time.
An Architecture for a Locking Scheduler
ƒ Assumptions:
–T do not request locks. It is the job of the
scheduler to insert lock actions into the
stream of reads, writes, and other actions that
access data.
–T do not release locks. The scheduler
releases the locks when the TM tells it that
the T will commit or abort.
Concurrency Control - Cont.
ƒ A Scheduler That Inserts Lock Actions
–Two main parts of the scheduler. Here are
some steps taken by the scheduler.
Part I takes the stream of requests generated by
the Ts and inserts the right lock actions ahead of
all DB-access operations such as read, write,
increment, or update.
Part II takes the sequence of lock and DB-access
actions and executes them.
If a lock can not be granted, then that T will be
delayed (blocked).
If all the locks are granted, then
a) If the action is DB-access, execute
b) If it is a lock action, then it examine the lock
table and modify it if necessary.
When a T commits or aborts, the TM will notify
Part I to release all the locks held by T.
When Part II is notified that a lock on element X is
available, it determines the next T or Ts that can
now be given a lock on X and executed.
Concurrency Control - Cont.
Managing Hierarchies of DB Elements
ƒ Locks with Multiple Granularity
–In the previous discussions, we use the term
DB element X. Element X can be a relation, a
block or page, or tuple(s).
–Large X will limit concurrency and small X
can lead to excessive I/O and overhead for
the lock manager.
ƒ Warning Locks
–Assumption: lock scheme of S and X, the
warning locks will be prefixed by I (meaning
"intention to") to the ordinary locks. (e.g. IS
means intention to obtain a S lock)
–Rules of the warning protocol are
1) To place an S or X lock on any element, we
begin at the root of the hierarchy.
2) If we are at the element that we want to lock,
then request an S or X lock on that element.
Concurrency Control - Cont.
3) If the element is further down the hierarchy,
then we place a warning at the current node.
Proceed to the appropriate child and repeat step 2
or 3 until we reach the desired node.
–Compatibility matrix for shared, exclusive,
and intention locks.
IS
IX
----------------------------IS
Y
Y
N
IX
Y
Y
N
S
Y
N
N
X
N
N
N
S
X
Y
N
Y
N
ƒ Phantoms and Handling Insertions Correctly
–In the "intention to" locking scheme, we must
place an X lock on the whole relation to
Concurrency Control - Cont.
Concurrency Control by Timestamps
ƒ Timestamps
–Each T is assigned a unique number by the
scheduler to identify the ordering of each T.
Two ways to generating timestamps:
a) Use the system clock, as long as we can
guarantee that no more than one T can use one
tick of the clock.
b) Use a internal counter in ascending order.
DB2, for example, uses a "log sequence number"
or (LSN).
–Representation of timestamps in Ts.
RT(X) denotes the read time of X.
WT(X) denotes the write time of X.
C(X) denotes the commit bit of X, which is true if
and only if the most recent T to write X has
already committed. It is used to prevent the "dirty
read" problem.
Concurrency Control - Cont.
ƒ Physically Unrealizable Behaviors
–Two kinds of problems when we try to
serialize based on timestamps.
1) Read too late - Transaction T tries to read X,
but transaction U which started after T wrote X
before T can read X. Therefore we must abort T
when this scenario occurs.
2) Write too late - Transaction T tries to write X,
but transaction U which started after T read X
before T can write X. Thus it is causing a conflict.
ƒ Problems With Dirty Data
–Using the commit bit can help these issues.
1) At T0, U starts. At T1, U writes X. At T2 T
starts. At T3, T reads X. At T4, U aborts.
2) At T0, T starts. At T1, U starts. At T2, U writes
X. At T3, T writes X. At T4, T commits. At T5, U
aborts.
–When a T writes X, it is tentative until a
commit or abort is issued.
Concurrency Control - Cont.
ƒ The Rules for Timestamp-Based Scheduling
–In response to a read or write request from a
T, the scheduler can:
Grant the request.
Abort T and restarting T with a new timestamp.
Delay T and later decide whether to abort T or
grant the request.
ƒ Timestamps and Locking
–Timestamping is better when most Ts are
read-only, or it is rare that concurrent Ts will
try to read and write the same element. In
high conflict situations, locking performs
better.
Locking will frequently delay Ts as they wait for
locks, and can lead to deadlocks.
If concurrent Ts frequently read and write
elements in common, then rollbacks will be
frequent. This leads to even more delay than a
locking system.
Concurrency Control - Cont.
Concurrency Control by Validation
ƒ It allows Ts to access data without locks. The
scheduler maintains a record of what active Ts
are doing, rather than read and write times in
the case of timestamp. Before a T starts to
write element X, it goes thru a "validation
phase" where the sets of elements it read and
wrote and compared with the write sets of
other active Ts. If conflict exists, the T is rolled
back.
ƒ Validation-Based Scheduler
–The scheduler is given a read set RS(T) and
write set WS(T) for each T.
–Ts are executed in three phases:
1) Read - the T reads from the DB all the
elements in its read set. The T computes in its
local address space all the results it is going to
write.
2) Validate - it compares the RS(Ti) and WS(Ti)
with other active Ts. If it fails, then Ti rolled back,
else it proceeds to the third phase.
Concurrency Control - Cont.
3) Write - the T writes to the DB for each DB
elements in the WS(T).
–To support the decision whether to validate a
T, the scheduler maintains three sets:
START - the set of Ts that have started but not yet
completed validation. The scheduler maintains
START(T), start time of T for each T.
VAL - the set of Ts that have been validated but
not yet finished the writing of phase 3. In this set,
the scheduler maintains both the START(T) and
VAL(T).
FIN - the set of Ts that have completed phase 3.
For these Ts, the scheluder records START(T),
VAL(T), and FIN(T).
ƒ Concurrency-Control Mechanism Comparisons
–Storage utilization:
Locks - Space in the lock table is proportional to
the number of DB elements locked.
Timestamps - Can be stored in a table for all the
active Ts.
Concurrency Control - Cont.
Validation - Space is need for timestamps and
read/write sets for each currently active T, plus a
few more Ts that finished after some currently
active T began.
–Performance with competing Ts.
Locking delays Ts but avoids rollbacks, even when
interaction is high. Timestamps and validation do
not delay Ts, but can cause them to rollbacks,
which is a more serious form of delay.
If interference is low, then neither timestamps nor
validation will cause many rollbacks, and may be
preferable to locking because they generally have
lower overhead than a locking scheduler.
When a rollback is necessary, timestamps catch
some problems earlier than validation, which
always lets a T do all its internal work before the
"validation phase" which decides whether a T must
rollback or not.
More About Transaction
Management
Serializability and Recoverability
ƒ Cascading Rollback
–If T1 reads any uncommitted "dirty" data
modified by T2, when T2 aborts, T1 must also
abort as well.
–Both the timestamp-based (with commit bit)
and validation-based schedules can avoid
cascading rollback.
ƒ Recoverable Schedules
–A schedule is recoverable if each T commits
only after each T from which it has read has
committed.
–In order to use one of the logging methods,
the log's commit records reach disk in the
order in which they are written.
ƒ Schedules That Avoid Cascading Rollback
–A schedule avoids cascading rollback (ACR)
if Ts may only read committed values.
–Every ACR schedule is recoverable.
More About Transaction
Management - Cont.
ƒ Managing Rollbacks Using Locking
–In a lock-based scheduler, a simple way to
guarantee that there are no cascading
rollbacks is Strict Locking.
Any T must not release any X locks (or other
"write" locks) until the T has either committed or
aborted, and the commit or abort log record has
been flushed to disk.
Every strict schedule is ACR.
Every strict schedule is serializable.
ƒ Group Commit
–Allow the releasing of locks before the
commit record on the log is actually flushed to
disk.
Do not release locks until the T finishes, and the
commit log record at least appears in a buffer.
Flush log blocks in the order that they were
created.
More About Transaction
Management - Cont.
ƒ Logical Logging
–Only the changes to the blocks are
described. Depending on the change, there
are several levels of complexity.
A small number of bytes of the DB element are
changed, e.g. update of a fixed-length field. Only
record the changed bytes and their positions.
Small change but it affects the whole record or
block, e.g. update of a variable-length field.
The change affects many bytes of a DB element,
and further changes can prevent this change from
ever being undone.
ƒ Recovery From Logical Logs
–Log Sequence Numbers (LSNs)
Each log record is given a number one greater
than that of the previous log record. Thus a typical
log record has the form <L,T,A,B> where:
More About Transaction
Management - Cont.
L is the log sequence number, an integer.
T is the transaction involved.
A is the action performed by T, e.g., "insert of
tuple t."
B is the block on which the action was
performed.
For each action, there is a compensating action
that logically undoes the action. For example, the
compensating action for "insert tuple t" is "delete
tuple t".
If a T aborts, then for each action performed on
the DB by T, the compensating action is
performed, and the fact that this action was
performed is also recorded in the log.
Each block maintains, in its header, the LSN of the
last action that affected that block.
–Suppose we need to use the logical log to
recover after a crash. Here is an outline of
the steps to take.
More About Transaction
Management - Cont.
1. Reconstruct the state of the DB at the time of
crash, including blocks whose current values were
in buffers and therefore got lost.
a) Find the most recent checkpoint on the log
and determine from it the set of Ts that were
active at that time.
b) For each log entry <L,T,A,B>, compare the
LSN N on block B with the LSN L for this log
record. If N < L, then redo action A; that action
was never performed on block B. Otherwise do
nothing.
c) For each log entry that informs us that a T
started, committed, or aborted, adjust the set of
active Ts accordingly.
2. The set of Ts that remain active when we reach
the end of the log must be aborted.
a) Scan the log again, this time from the end
back to the previous check-point. When we see
a record <L,T,A,B> for T that must be aborted,
perform the compensating actions.
More About Transaction
Management - Cont.
b) If we must abort a T that began prior to the
most recent checkpoint, then continue back in
the log until the start-records for all such Ts have
been found.
c) Write abort-records in the log for each of the
Ts we had to abort.
Resolving Deadlocks
ƒ There two broad approaches to dealing with
deadlock. We can detect deadlocks and fix
them, or we can manage Ts in such a way that
deadlocks are never able to form.
ƒ Deadlock Detection by Timeout
–When a deadlock occurs, it is very difficult to
recover from it unless one of more Ts will
have to be rolled back.
–Timeout limits the amount of time that each T
can be active before it is rolled back.
More About Transaction
Management - Cont.
ƒ Waits-For Graph
–Use to indicate potential deadlocks
–Each T who is waiting or holding a lock will
be represent in a node of the graph. An
arc/line is drawn from node A to node B if:
1) B holds a lock on X,
2) A is waiting for a lock on X, and
3) A cannot get a lock on X in its desired mode
unless B first releases its lock on X.
–If there are no cycles in the waits-for graph,
then each T can eventually complete. If there
is a cycle, then no T in the cycle can ever
make progress, therefore we have a
deadlock.
–To avoid a deadlock, one strategy is to roll
back any T that makes a request that would
cause a cycle in the waits-for graph.
More About Transaction
Management - Cont.
ƒ Deadlock Prevention by Ordering Elements
–Order the DB elements in some arbitrary but
fixed order (e.g. block addresses).
–If every T is required to request locks on
elements in order, then there can be no
deadlock due to Ts waiting for locks.
–Good idea but not practical
ƒ Detecting Deadlocks by Timestamps
–Note that this timestamp is not the same as
the timestamp used in concurrency control
where a new timestamp is generated when
the T is restarted.
–The timestamp for deadlock detection never
changes.
–If we have two transactions, T and U, one of
them is older (has the earlier timestamp).
There are two different policies that can be
used to detect deadlocks.
More About Transaction
Management - Cont.
–T is waiting for a lock that is held by U.
1) The Wait-Die Scheme:
a) If T is older than U, then T is allowed to wait
for the lock(s) held by U.
b) If U is older than T, then T "dies"; it is rolled
back.
2) The Wound-Wait Scheme:
a) If T is older than U, it "wounds" U. U must roll
back and relinquish to T the lock(s) that T needs
from U. But if U has already finished and
released its locks by the time the "wound" takes
effect, then U doesn't need to be rolled back.
b) If U is older than T, then T waits for the lock(s)
held by U.
ƒ Comparison of DL-Management Methods
–In the wait-die and wound-wait cases, older
Ts kill off the newer ones, thus there will be no
starvation.
More About Transaction
Management - Cont.
–Both wound-wait and wait-die are easier to
implement than a system that maintains or
periodically constructs the waits-for graph.
–Using the waits-for graph minimizes the
number of times we must abort a T because
of a deadlock. On the other hand, either
wound-wait or wait-die will sometimes roll
back a T when there was no deadlock.
Distributed Databases
ƒ Distribution of Data
–Horizontal Decomposition
Each local relations are called fragments
Visualize a single relation with its tuples separated
by horizontal lines into sets of tuples at each
location.
–Vertical Decomposition
Each location has its own relation
Logically join or union by using a view
More About Transaction
Management - Cont.
ƒ Distributed Transactions
–A single transaction can cause a multi-site
I/U/D.
–Issues including local scheduler and logger,
commit/abort, global and local locks.
ƒ Data Replication
–Can improve performance and reliability
using the HA or fail-safe setup.
ƒ Distributed Query Optimization
–Cause an extra layer of complexity when
dealing with distributed data.
Distributed Commit
ƒ Atomicity
–Many possible errors during a distributed T.
–e.g. Network problems, tranactional errors
that require a multi-site rollback.
More About Transaction
Management - Cont.
ƒ Two-Phase Commit
–Make distributed T atomic by enforcing an all
or nothing for each local component commit.
–Designate the node which the T originated as
the "coordinator node".
–Phase I
Coordinator places a log record <Prepare T> on
its local log.
It sends to each component's site the "Prepare T"
message.
Each site will determine whether to commit or
abort its component of T and send a reply back.
If a site wants to commit its component, it must
enter a "precommitted" state with a local <Ready
T> log record. Once in this state, the T cannot be
aborted unless the coordinator instruct it to do so.
If a site wants to abort its component of T, then it
logs the record <Don't commit T> and sends the
message back to the coordinator.
More About Transaction
Management - Cont.
–Phase II
If the coordinator received "Ready T" from all
components of T, then it logs <Commit T> at its
site and sends "Commit T" message to all the
sites involved in T.
If the coordinator received a "Don't commit T" from
one or more sites, it logs <Abort T> at its site and
sends "Abort T" message to all the sites involved
in T.
For each site who received a "Commit T", it
commits that component of T and writes <Commit
T> in its log.
For each site who received a "Abort T", it aborts T
and writes <Abort T> in its log.
ƒ Recovery of Distributed Transactions
–Use the same techniques described in
Chapter 17 such as undo/redo logging except
the added communication between sites.
–The difficult case is when the C node also
failed. Then all other sites must wait for the C
node to recover first before continuing.
More About Transaction
Management - Cont.
Distributed Locking
ƒ Centralized Lock Systems
–A simple approach to designate one site as
the "lock site" to maintain a lock table for all
logical elements, whether they are local or
remote.
–Cost is 3 messages per lock (request, grant,
and release).
–Can be a bottleneck and crash recovery at
the lock site can be problematic.
ƒ Distributed Locking Algorithms
–Use messages to request and release locks
to other sites or nodes.
–Once all the locks are obtained, the T will
continue as normal.
–Distributed locks wait time is usually longer
than local locks.
–Lock coordinator can reduce the number of
messages between all the sites.
More About Transaction
Management - Cont.
ƒ Locking Replicated Elements
–If we have replication setup, then an X lock
must be distributed as well. e.g. All replica
copies of DB element A must be locked
currently.
ƒ Primary-Copy Locking
–Each logical element X has a primary copy.
–All lock requests must goes to the site of the
primary copy.
Long-Duration Transactions
ƒ Problems of Long Transactions
–A long running T holds locks for a long period
of time and it essentially blocks other Ts from
obtaining the same DB elements. Thus other
short Ts may be timeout and/or rolled back.
–There are three broad classes of applications
that involve long Ts.
More About Transaction
Management - Cont.
1) Conventional DBMS Applications - While most
DBMS applications contain short Ts, some
operation may require a few long Ts.
2) Design Systems - When there are many
components such as source files of a software
project, we can have many engineers who may
need to modify a common components. Thus a
check-in/check-out system is commonly used to
control such process. If the check-out operation is
viewed as an exclusive lock, then it can be locked
for a long time.
3) Workflow Systems - These systems involve
collections of processes, some are automated
and some can be manual, it can hold an exclusive
lock on all the data involved in a T. Again, it can
possibly affect many other Ts for hours or days.
ƒ Sagas
–It is a collection of actions that form a longduration T.
More About Transaction
Management - Cont.
–A saga consists of
1. A collection of actions.
2. A graph whose node are either actions or
the special Abort and Complete nodes, and
whose arcs link pairs of nodes. No arcs leave
the two special nodes, which we call terminal
nodes.
3. An indication of the node at which the action
starts, called the start node.
–Concurrency control for sagas
Each action can be viewed as a short T. Using a
simple locking scheme, it can prevent two Ts from
trying to write new values at the same time.
The overall T, A1,A2,A3,..., is managed using
"compensating transactions". Their job is to roll
back the effect of a committed action in a way that
does not depend on what has happened to the DB
between the time the action was executed and the
time the compensating T is executed.
More About Transaction
Management - Cont.
ƒ Compensating Transactions
–In a saga, each action A has a compensating
transaction, with denoted by A(-1). If we
execute A and followed by A(-1), the result is
the same if A was never had executed.
–If D is any DB state, and B1B2...Bn is any
sequence of actions and compensating Ts
(whether from the saga in question or any
other saga or T that may legally execute on
the database) then the same DB states result
from running the sequences B1B2...Bn and
AB1B2...BnA(-1) on the DB state D.
–If a saga execution leads to the Abort node,
then we roll back the saga by executing the
compensating Ts for each executed action, in
the reverse order of those actions.