Transcript Document
Part III: Storage and Indexing
(Chapters 8-11)
Part V: Transctions, Concurrency control,
Scheduling, and Recovery
(Chapters 16-18)
Why Concerning Storage and Indexing?
Performance: another major factor in user satisfaction
Depends on
• Efficient data structures for data representation
• Efficiency of system operation on those structures
Disks contains data files and system files including
dictionary and index files
Disk access: one of the most critical factor in performance.
Why Not Store Everything in Main Memory?
Cost and size
Main memory is volatile: What’s the problem?
Typical storage hierarchy:
Factors: access speed, cost per unit, reliability
Cache and main memory (RAM) for currently used
data: fast but costly
Flash memory: limited number of writes (and slow),
non-volatile, disk-substitute in embedded systems
Disk for the main database (secondary storage).
Tapes for archiving older versions of the data (tertiary
storage).
Buffer Management in a DBMS
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
DISK
DB
choice of frame dictated
by replacement policy
CC & Recovery may require additional I/O when a
frame is chosen for replacement. Why?
Indexes
An index on a file speeds up selections on the search key
fields for the index.
Any subset of the fields of a relation can be the
search key for an index on the relation.
Search key is not the same as key (minimal set of
fields that uniquely identify a record in a relation).
An index contains a collection of data entries, and
supports efficient retrieval of all data entries k* with a
given key value k.
Given data entry k*, we can find record with key k
quickly.
Classes: dense/sparse index, primary/secondary,
clustered/un-clustered
Dense vs Sparse Index
Dense index: one index
entry per search key
value.
Sparse index: index
records for only some of
the records
Ashby, 25, 3000
22
Basu, 33, 4003
Bristow, 30, 2007
30
Ashby
33
Cass
Cass, 50, 5004
Smith
Daniels, 22, 6003
Jones, 40, 6003
Every sparse index is
clustered!
Sparse indexes are smaller
Which one is faster?
Which one has less
overhead?
25
40
44
Smith, 44, 3000
44
50
Tracy, 44, 5004
Sparse Index
on
Name
Data File
Dense Index
on
Age
Clustered vs. Unclustered Index
Suppose that Alternative (2) is used for data entries,
and that the data records are stored in a Heap file.
To build clustered index, first sort the Heap file (with some
free space on each page for future inserts).
Overflow pages may be needed for inserts. (Thus, order of
data recs is `close to’, but not identical to, the sort order.)
CLUSTERED
Index entries
direct search for
data entries
Data entries
UNCLUSTERED
Data entries
(Index File)
(Data file)
Data Records
Data Records
Index Trees
As for any index, 3 alternatives for data entries k*:
Data record with key value k
<k, rid of data record with search key value k>
<k, list of rids of data records with search key k>
Choice is orthogonal to the indexing technique
used to locate data entries k*.
Tree-structured indexing techniques support
both range searches and equality searches.
ISAM (indexed sequential access method): static
structure (data entries reside in leaf pages and
overflow pages)
B+ tree: dynamic, adjusts gracefully under
inserts and deletes.
ISAM
index entry
P
0
K
1
P
1
K 2
P
2
K m
Pm
Index file may still be quite large. But we can
apply the idea repeatedly – making a tree
Non-leaf
Pages
Leaf
Pages
Overflow
page
Leaf pages contain data entries.
Primary pages
Example ISAM Tree
Each node can hold 2 entries; no need for `next-leaf-page’
pointers. Why?
Sequential allocation of leaf pages.
Insert 23.
Insert 48, 41, 42.
Root
40
10*
15*
20
33
20*
27*
51
33*
37*
40*
46*
51*
63
55*
63*
97*
After Inserting 23*, 48*, 41*, 42* ...
Root
40
Index
Pages
20
33
20*
27*
51
63
Primary
Leaf
10*
15*
33*
37*
40*
46*
48*
41*
Pages
Overflow
23*
Pages
42*
51*
55*
63*
97*
B+ Tree: Most Widely Used Index
Insert/delete at log F N cost; keep tree heightbalanced. F = fanout , N = # leaf pages;
F is typically >> 2. Why?
Minimum 50% occupancy (except for root). Each
node contains d <= m <= 2d entries. The parameter
d is called the order of the tree.
Supports equality and range-searches efficiently.
Index Entries
(Direct search)
Data Entries
("Sequence set")
Extendible Hashing
Situation: Bucket (primary page) becomes full.
Why not re-organize file by doubling # of buckets?
Reading and writing all pages for reorganizing a data
file is expensive
Idea: Use directory of pointers to buckets, double # of
buckets by doubling the directory, splitting just the bucket
that overflowed!
Directory much smaller than file, so doubling it is much
cheaper. Only one page of data entries is split. No
overflow page!
Trick lies in how hash function is adjusted!
LOCAL DEPTH
Example
GLOBAL DEPTH
2
Directory is array of size 4.
To find bucket for r, take
last `global depth’ # bits of
h(r); we denote r by h(r).
If h(r) = 5 = binary 101,
it is in bucket pointed to
by 01.
00
2
4* 12* 32* 16*
Bucket A
2
1*
5* 21* 13*
Bucket B
01
10
2
11
10*
DIRECTORY
Bucket C
2
15* 7* 19*
Bucket D
DATA PAGES
Insert: If bucket is full, split it (allocate new page, re-distribute).
If necessary, double the directory. (When is splitting a
bucket does not require doubling? We can tell by
comparing global depth with local depth for the split bucket.)
Insert h(r)=20 (Causes Doubling)
--- How to determine keys in A and A2?
LOCAL DEPTH
2
32*16*
GLOBAL DEPTH
2
00
Bucket A
3
32* 16* Bucket A
GLOBAL DEPTH
2
3
1* 5* 21*13* Bucket B
01
000
2
1* 5* 21* 13* Bucket B
001
10
2
11
10*
Bucket C
15* 7* 19*
Bucket D
2
011
10*
Bucket C
101
2
110
15* 7* 19*
Bucket D
111
2
4* 12* 20*
010
100
2
DIRECTORY
LOCAL DEPTH
Bucket A2
(`split image'
of Bucket A)
3
DIRECTORY
4* 12* 20*
Bucket A2
(`split image'
of Bucket A)
Linear Hashing
This is another dynamic hashing scheme, an
alternative to Extendible Hashing.
LH handles the problem of long overflow chains
without using a directory, and handles duplicates.
Idea: Use a family of hash functions h0, h1, h2, ...
hi(key) = h(key) mod(2iN); N = initial # buckets
h is some hash function (range is not 0 to N-1)
If N = 2d0, for some d0, hi consists of applying h and looking
at the last di bits, where di = d0 + i.
hi+1 doubles the range of hi (similar to directory doubling)
Overview of LH File
In the middle of a round.
Bucket to be split
Next
Buckets that existed at the
beginning of this round:
this is the range of
Buckets split in this round:
If h Level ( search key value )
is in this range, must use
h Level+1 ( search key value )
to decide if entry is in
`split image' bucket.
hLevel
`split image' buckets:
created (through splitting
of other buckets) in this round
Lawyers
Recently reported in the Massachusetts Bar Association Lawyers
Journal, the following are questions actually asked of witnesses
by attorneys during trials:
"Now doctor, isn't it true that when a person dies in his sleep,
he doesn't know about it until the next morning?“
"Were you alone or by yourself?“
"Was it you or your younger brother who was killed in the war?“
"How far apart were the vehicles at the time of the collision?“
Q: "How was your first marriage terminated?"
A: "By death."
Q: "And by whose death was it terminated?"
Part V:
Transactions, Concurrency control,
Scheduling, and Recovery
Ch. 16 - 18
Overview
Transactions and ACID properties
Serial execution and serializable execution
Serializability (dependency) graph
Serializability theorem
Conflict equivalence and view equivalence
Properties of schedules: SR, RC, ACA, ST
Schedulers
3 options to handle the request from TM
optimistic (aggressive) vs pessimistic (conservative)
2PL and Strict 2PL
Timestamp ordering and Strict TO
Atomicity of Transactions
A transaction might commit after completing all its
actions, or it could abort (or be aborted by the DBMS)
after executing some actions.
A very important property guaranteed by the DBMS
for all transactions is that they are atomic.
Atomicity: a transaction is assumed to be executing
all its actions in one step, or not executing any
actions at all.
Not easy to achieve. Why?
Consistency, Isolation, Durability
A transaction executed in isolation must preserve
DB consistency.
Even if multiple transactions executed
concurrently, each should be unaware of other
transactions are being executed concurrently.
When a transaction complete successfully, the
changes it made must persist, even with failures
afterwards.
Conflicts and Equivalence
When do two operations conflict?
They are issued by different transactions
They operate on the same data object
At least one of them is a write operation
Conflict equivalent
Two executions are conflict equivalent if in both
executions, all conflicting operations have the same
order.
Serializability
Correctness criterion
Serializability is the correctness definition of DB
All serializable schedules are equally correct
Scheduling algorithms enforce certain ordering
In distributed DBMS, variable delays may disturb any
particular ordering which is supposed to occur
Serialization graph (dependency graph) shows dependency
relationship among transactions
Serialization Theorem
For a schedule H, if SG(H) is acyclic, then H is
serializable.
Properties of Schedules
Recoverability
To ensure that aborting a transaction does not change
the semantics of committed transactions
w1(x)r2(x)w2(y)C2
Is it recoverable? What if T1 aborts?
Recoverable execution depends on commit order
A transaction cannot commit until all values it read are
guaranteed not to be aborted. How to do it?
Delaying commit: T2 cannot commit until T1 commits
Cascaded abort is sometimes necessary. Why?
w1(x)r2(x)w2(x)A1
Properties of Schedules
Recoverability
Cascaded abort is sometimes necessary
w1(x)r2(x)w2(x)A1
Avoiding cascaded aborts
Achieved if every transaction reads only the values
written by committed transactions
Must delay each r(x) until all transactions that issued
w(x) is either committed or aborted
w1(x) ….. C1 r2(x) w2(y) …
Properties of Schedules
Restoring before images
Implementing transaction abort by simply restoring
before images of all writes is very convenient
w0(x)w1(x)w2(x) A1 A2
Value of x must be restored to the initial value, not the
value written by T1
Solution: delay w(x) until all transactions that have
written x are either committed or aborted
Strictness
Executions that satisfy both requirements
Delay both r(x) and w(x) until all transactions that have
written w(x) are either committed or aborted
r1(x)w1(x) w1(y) w2(z) w2(x) C1 --- is it strict?
Relationships among Properties
Recoverability (RC)
RC if Ti reads from Tj and Ci is in H, then Ci follows
Cj
Avoiding cascaded aborts (ACA)
ACA if Ti reads x from Tj then ri(x) follows Cj
Strictness (ST)
ST if whenever Oi(x) follows wj(x), then Oi(x) follows
either Aj or Cj
What is the relationship among ST, ACA, and RC?
ST < ACA < RC
What about with SR and Serial execution?
Two-Phase Locking (2PL)
Two-Phase Locking Protocol
Each Xact must obtain a S (shared) lock on object
before reading, and an X (exclusive) lock on object
before writing.
A transaction can not request additional locks
once it releases any locks.
If an Xact holds an X lock on an object, no other
Xact can get a lock (S or X) on that object.
Multiple-Granularity Locks
Why consider it?
Database consists of tables, pages, tuples (records)
Hard to decide what granularity to lock (tuples vs.
pages vs. tables).
Shouldn’t have to decide. How?
Data “containers” are nested:
Database
contains
Tables
Pages
Tuples
The Phantom Problem
T1 implicitly assumes that it has locked the set of all
sailor records with rating = 1.
Assumption only holds if no sailor records are
added while T1 is executing!
Why did this problem happen?
Example shows that conflict serializability guarantees
serializability only if the set of objects is fixed!
Need some mechanism to enforce this assumption -index locking and predicate locking
Index Locking
Index
r=1
Data
If there is a dense index on the rating field using
Alternative (2), T1 should lock the index page
containing the data entries with rating = 1.
If there are no records with rating = 1, T1 must
lock the index page where such a data entry would
be, if it existed!
What if there is no suitable index?
T1 must lock all pages, and lock the file/table to
prevent new pages from being added, to ensure
that no new records with rating = 1 are added.
Predicate Locking
Grant lock on all records that satisfy some logical
predicate, e.g. age > 2*salary.
Index locking is a special case of predicate locking for
which an index supports efficient implementation of the
predicate lock.
What is the predicate in the sailor example?
rating=1
Why not using predicate locks in commercial DBMS?
In general, predicate locking has a significant locking
overhead.
Locking in B+ Trees
How can we efficiently lock a particular leaf node?
Don’t confuse this with multiple granularity locking
-- How are they different?
One solution: Ignore the tree structure, just lock
pages while traversing the tree, following 2PL -What’s wrong?
This has terrible performance!
Root node (and many higher level nodes) becomes a
bottleneck. Why?
Because every tree access begins at the root.
B+ Tree Locking
Higher levels of the tree only direct searches
for leaf pages.
For inserts, a node on a path from root to
modified leaf must be locked (in X mode, of
course), only if a split can propagate up to it
from the modified leaf. (Similar point holds
w.r.t. deletes.)
We can exploit these observations to design
efficient locking protocols that guarantee
serializability even though they violate 2PL.
A Simple Tree Locking Algorithm
Search: Start at root and go down; repeatedly,
S lock child then unlock parent.
Insert/Delete: Start at root and go down,
obtaining X locks as needed. Once child is
locked, check if it is safe:
If child is safe, release all locks on ancestors.
Safe node: Node such that changes will not
propagate up beyond this node.
When is a node safe for inserts?
• Node is not full.
When is a node safe for deletes?
• Node is not half-empty.
Timestamp Ordering
Idea: Any conflicting operations are executed
in their timestamp order
Simple and aggressive
Schedule immediately and reject requests
that arrive too late
How do you know a request has arrived too
late?
Give each object a read-timestamp (RTS) and a
write-timestamp (WTS), give each transaction a
timestamp (TS) when it begins
Timestamp Ordering
Timestamp ordering rule:
If Oi(x) and Oj(x) are conflicting operation,
Oi(x) is processed before Oj(x), if and only if
TS(Ti) < TS(Tj).
Request arriving too late:
Oi(x) arrives after the scheduler has sent
conflicting operation Oj(x) with TS(Tj) > TS(Ti)
Basic Timestamp Ordering
Ri(x): if TS(Ti) < WTS(x), reject it;
otherwise (TS(Ti) >=WTS(x)), then schedule it and
set RTS(x) to max (RTS(x), TS(Ti))
Wi(x): if TS(Ti) < RTS(x) or TS(Ti) <WTS(x), reject it;
otherwise (TS(Ti) >=WTS(x)), then schedule it and
set WTS(x) to max (WTS(x), TS(Ti))
When restarted, Ti is assigned a new timestamp
Thomas Write Rule:
For wi(x), if TS(Ti) < WTS(x) and TS(Ti) >= RTS(x),
then wi(x) can be ignored, rather than being rejected.
Why is it correct?
Ignoring obsolete write
Exercise: Non-equivalence of 2PL and TO
1.
2.
H1=r2(x) w3(x) C3 w1(y)C1 r2(y) w2(y) C2
Is this schedule possible to timestamp ordering?
Is it possible with 2PL?
H1 is legal with strict timestamp ordering. What
is the equivalent serial schedule?
T1 T2 T3
It is not possible with 2PL
T2 must release lock on x for T3, but then gets
lock on y – violation of two-phaseness
Relationship between 2PL and TO
Schedules generated by 2PL and TO
They are all correct (serializable)
They are not the same set: H1 shows that
Is the relationship inclusive?
S {schedules by 2PL} subset of S {schedules by TO}?
S {schedules by TO} subset of S {schedules by 2PL}?
Consider w3(x) C3 w2(x) C2 r1(x)
Is it legal with TO?
Is it legal with 2PL?
Two sets of schedules are intersecting, but subset
2PL
TO
SR
Failure and Recovery
Failure and consistency
Principle of recovery
Transaction failures
System failures
Media failures
Redundancy
Database can be protected by ensuring that its correct
state can be reconstructed from information stored
redundantly in the system
Recovering database – restart operation
Bringing the stable DB to a consistent state by removing
effects of uncommitted transactions and applying missing
effects of committed transactions.
Recovery and Restart
Types of storage media
Volatile storage: fast , but not surviving system failures
Non-volatile storage
Stable storage: information never lost (practically)
Recovery
1)
2)
Ideally, stable DB should contain, for each data item, the
last value written by committed transaction
Practically, stable DB may contain values written by
uncommitted transactions, or may not contain the last
committed values.
Why?
Updating of uncommitted T
Buffering of committed values in the cache
Function of Recovery Manager
Atomicity:
Transactions may abort (“Rollback”).
Durability:
What if DBMS stops running? (Causes?)
Desired Behavior after
system restarts:
– T1, T2 & T3 should be
durable.
– T4 & T5 should be
aborted (effects not seen).
T1
T2
T3
T4
T5
crash!
Recovery Management
Design rules for recovery manager
Restart activity
Undo rule: committed values must be saved before
overwritten by uncommitted values in the stable DB
Redo rule: before commit, new values it wrote must be in
the stable storage (DB or log)
Preparation: during normal operation
Actual recovery: after failure
Preparation
Logging
Checkpointing
Cache Manager
Two operations: fetch and flush
When to flush?
Use dirty bit for deciding flushing operation
Flush: if the slot in cache is not dirty, do nothing;
otherwise, copy the value into stable storage
Fetch: select a slot, using replacement algorithm if full
(and flush if necessary), copy the value into slot, reset
dirty bit, update cache directory
Depends on recovery strategy of the system
Different recovery algorithms use different strategies
Idempotence of restart
Any sequence of incomplete execution, followed by a
complete execution of restart has the same effect of just
one complete execution
Handling the Memory Pool
Write to disk: force/no-force
Cache page: steal/no-steal
Force every write to disk?
Poor response time.
But provides durability.
No Steal
Force
Steal buffer-pool frames from
uncommited Xacts?
No Force
If not, poor throughput.
If so, how can we ensure
atomicity?
Steal
Trivial
Desired
More on Steal and Force
STEAL (why enforcing Atomicity is hard)
To steal frame F: Current page in F (say P) is written
to disk; some Xact holds lock on P.
• What if the Xact with the lock on P aborts?
• Must remember the old value of P at steal time
(to support UNDOing the write to page P).
NO FORCE (why enforcing Durability is hard)
What if system crashes before a modified page is
written to disk?
Write as little as possible, in a convenient place, at
commit time, to support REDOing modifications.
Recovery Algorithms
Undo/redo algorithm
Comparison with other recovery algorithms
Most complicated of the four recovery algorithms
Flexible in deciding when to flush (no-force)
Maximize efficiency during normal operation at the
expense of less efficient recovery
Issues: disk I/O, log space, recovery time
No-redo requires more frequent flush (force)
Uncommitted transaction is allowed to replace dirty slot
for in-place update – undo might be necessary
Restart procedure
Process log forward and backward for redo and undo
Undo/Redo Recovery
A transaction T writes vale V to data object X.
What will happen?
System fetches X if it is not already in cache
Record V in the log and in X’s slot C
No need for the cache manager to flush C
If cache manager replaces C (steal), and either T aborts
or system fails before T commits, undo is required
If T commits and system fails before C is flushed (no
force), redo is required
Restart Procedure for Undo/Redo Recovery
1.
2.
3.
4.
Discard all cache slots
Scan the log to analyze which transactions committed,
aborted, or active, to determine data for redo/undo
Redo all actions that were committed but not recorded
in the stable DB
Undo all actions of transactions that were aborted or
active at the time of failure