Transcript Document
Chapter 17:
Recovery
Chapter 17: Recovery System
Failure Classification
Storage Structure
Recovery and Atomicity
Log-Based Recovery
Shadow Paging
Recovery With Concurrent Transactions
Buffer Management
Failure with Loss of Nonvolatile Storage
Advanced Recovery Techniques
ARIES Recovery Algorithm
Remote Backup Systems
Database Backup
Database backup and recovery constitiute
important functions of a DBMS
Automatic scheduling to backup storage device
A full backup or dump of the database
A differential backup with only last modifications
done after the previous backup copy
Only the transaction logs.
Backups are stored in secured sites and in
stable storage devices
Recovery Concepts
Database recovery is the process of restoring the database to a correct
/ consistent state in the event of failure.
Recovery process ensures the atomicity and durability properties of
transaction
There are many different types of failure, each of which must be
handled appropriately. Among the causes of failure are:
Whatever the cause of failure, there are two main effects that must be
considered:
System crashes due to hardware or software errors.
Media failures.
Natural disasters such as fire, flood, etc.
Human error on the part of operators or users, application software errors
Sabotage
The loss of main memory, including database buffers
The loss of the disk copy of the database.
The recovery scheme must also provide high availability, i.e. minimize
the time for which the database is not usable after the crash.
Failure Classification
Hardware failures
Disk failure / Media failure : a head crash or similar disk failure destroys all or part of disk storage
Can occur when using client-server configuration or distributed database systems where multiple
database systems are connected by a common network.
Transaction failure :
Failures related to OS, DBMS etc.
Application Software errors : Logical errors in programs that access the database
Network failures
Memory errors, bad disk sectors, disk full errors
Software failures
Destruction is assumed to be detectable: disk drives use checksums to detect failures
Logical errors: transaction cannot complete due to some internal error condition
System errors: the database system must terminate an active transaction due to an error
condition (e.g., deadlock)
System crash: a power failure or other hardware or software failure causes the system to
crash (loss of main memory)
Natural physical disasters : fire, floods, earthquake etc.
Human errors : due to carelessness, unintentional destruction of data by operators, users
etc.
Sabotage
Fail-stop assumption: non-volatile storage contents are assumed to not be
corrupted by system crash
Database systems have numerous integrity checks to prevent corruption of disk data
Non-Catastrophic Failures
A computer failure (system crash):
A transaction error:
A hardware failure,
A software failure,
A network failure
Integer overflow,
Division by zero,
Logical error,
User interruption,
Exception condition
Concurrency control enforcement:
Violated serializability
Deadlock
Recovery Algorithms
Recovery algorithms are techniques to ensure
database consistency and transaction atomicity and
durability despite failures
Recovery algorithms have two parts
1.
2.
Actions taken during normal transaction processing to
ensure enough information exists to recover from failures
Actions taken after a failure to recover the database
contents to a state that ensures atomicity, consistency and
durability
Recovery Classification
Recovery process first determines the type and
extent of recovery required
According to the type of a failure, recovery
procedures classify to:
Entire database
Only unstable committed records
Recovery from a catastrophic (like disk crash) failure, and
Recovery from a non catastrophic failure
Recovery from catastrophic failure is based on
restoring a database back_up copy by redoing
operations of committed transactions (stored in an
archived log file) up to the time of the failure
Recovery Classification (continued)
If a database becomes inconsistent due to a non
catastrophic failure, the strategy is to reverse only those
changes that made database inconsistent
It is accomplished by undoing (and sometimes also
redoing) some operations. A memory log file is used here
From now on we consider only recovery from non disk
crash failures (we suppose data on disk are safe)
The recovery from non catastrophic failures can be based
on many algorithms, as:
Log based recovery
Shadow paging
Storage Structure
Volatile storage:
Nonvolatile storage:
does not survive system crashes
examples: main memory, cache memory
Access to volatile storage is extremely fast
survives system crashes
examples: disk (online storage), tape, flash memory, magnetic tapes
(archival storage) , optical disk
Access to volatile storage is slower than volatile storage by several
orders of magnitude
Stable storage:
a mythical form of storage that survives all failures
approximated by maintaining multiple copies on distinct nonvolatile
media
Stable-Storage Implementation
To implement stable storage we need to replicate the needed
information in several non-volatile storage media.
Store archival backups on tapes
Maintain multiple copies of each block
on separate disks at same location (Disk mirroring, RAID level 1)
Maintain copies at remote sites to protect against disasters such as fire
or flooding.
Failure during data transfer can still result in inconsistent copies:
Block transfer between memory and disk can result in
Successful completion
Partial failure: destination block has incorrect information
Total failure: destination block was never updated
If a data transfer failure occurs, the recovery procedure must
detect and restore the block to a consistent stage.
Data Access
Database systems are partitioned into fixed length storage units
called blocks. Blocks are the unit of transfer between memory and
disk
Physical blocks are those blocks residing on the disk.
Buffer blocks are the blocks residing temporarily in main memory.
The area of memory where the blocks reside temporarily is called
Disk Buffer
Block movements between disk and main memory are initiated
through the following two operations:
Each transaction Ti has its private work-area in which local copies of
all data items accessed and updated by it are kept.
input(B) transfers the physical block B to main memory.
output(B) transfers the buffer block B to the disk, and replaces the
appropriate physical block there.
Ti's local copy of a data item X is called xi.
We assume, for simplicity, that each data item fits in, and is stored
inside, a single block.
Data Access (Cont.)
Transaction transfers data items between system buffer blocks
and its private work-area using the following operations :
Transactions
read(X) assigns the value of data item X to the local variable xi.
write(X) assigns the value of local variable xi to data item {X} in the
buffer block.
both these commands may necessitate the issue of an input(BX)
instruction before the assignment, if the block BX in which X resides is
not already in memory.
Perform read(X) while accessing X for the first time;
All subsequent accesses are to the local copy.
After last access, transaction executes write(X).
output(BX) need not immediately follow write(X). System can
perform the output (or force output) operation when it deems fit.
Example of Data Access
buffer
Buffer Block A
X
Buffer Block B
Y
input(A)
A
output(B)
B
read(X)
write(Y)
x1
x2
y1
work area
of T1
work area
of T2
memory
disk
Recovery and Atomicity
Modifying the database without ensuring that the
transaction will commit may leave the database
in an inconsistent state.
Consider transaction Ti that transfers $50 from
account A to account B; goal is either to perform
all database modifications made by Ti or none at
all.
Several output operations may be required for Ti
(to output A and B). A failure may occur after one
of these modifications have been made but
before all of them are made.
Recovery and Atomicity (Cont.)
To ensure atomicity despite failures, we first
output information describing the
modifications to stable storage without
modifying the database itself.
We study two approaches:
log-based recovery, and
shadow-paging
We assume (initially) that transactions run
serially, that is, one after the other.
Log Based Recovery
In case of failure, a transaction must be aborted or
committed to maintain data integrity
Transaction Log plays an important role in database
recovery
Transaction is the basic unit of recovery
A transaction begins with <T, Begin> and ends with
<Commit>
Two types of recovery
Forward Recovery
Backward Recovery
Forward Recovery (REDO)
Forward recovery is also called Roll Forward
Used in case of physical damage. For example disk crash, failures during
writing of data to database buffers or during flushing buffers to secondary
storage.
Intermediate result of transaction is stored in database buffers. From buffers
is transferred to secondary storage.
Flushing operation of buffers could be triggered by COMMIT operation of
transaction or automatically in the event of buffer becoming full.
If failure occurs between writing to buffers and flushing of buffers to
secondary storage, Recovery manger determines the status of transaction
that perform WRITE at time of failures.
If COMMIT is already issued, recovery manager will redo (roll forward) so
that updates are written to database.
Forward recovery guarantees durability of transaction.
Forward Recovery (REDO)
To recreate database
Last Database
Copy
Roll forward
program
Transaction
Log
Recreated
Database
1. Read most recent copy
of database
2. Read the Transaction
Log
3. The Roll forward
program reads log
entries starting from
the first and continuing
upto the last entry
4. For each of the log
entries, changes the
data value of the
database to the ‘after’
value
Backward Recovery (UNDO)
Backward Recovery is also called Roll-Backward
Used in case an error occurs in the midst of normal operation of
database.
Error could be a human keying in values or programs ending
abnormally and leaving some of changes to database that is was
suppose to made.
If transaction had not committed at the time of failure, it causes
inconsistency in the database.
Recovery manager must undo (roll backward) any effects of the
transaction to the database.
Backward Recovery guarantees the atomicity property of the
transaction
Backward Recovery (UNDO)
To recreate database
Current
Database
Roll backward
program
Transaction
Log
Corrected
Database
1. Read the database in
its current state
2. Read the Transaction
Log and position it in
the last entry
3. Read backward and
reset each updated
value to its ‘before
image’ until you reach
the point where the
error occured
4. The Roll-backward
program ‘undoes’ each
transaction in the
reverse order
Log-Based Recovery
To be able to recover from failures DBMS maintains a log file
A log is kept on stable storage.
The log is a sequence of log records, and maintains a record of update activities on the
database.
Typically, a log file contains records with following contents:
[start_transaction, T ]
[write_item, T, X, old_value, new_value]
[read_item,T, X ]
[commit, T ]
[abort, T ]
When transaction Ti starts, it registers itself by writing a
<Ti start>log record
Before Ti executes write(X), a log record <Ti, X, V1, V2> is written, where V1 is the
value of X before the write, and V2 is the value to be written to X.
Log record notes that Ti has performed a write on data item Xj Xj had value V1 before the write,
and will have value V2 after the write.
When Ti finishes it last statement, the log record <Ti commit> is written.
Two approaches using logs
Deferred database modification
Immediate database modification
Deferred Update
The idea behind the deferred update is to postpone
updates to the database until transaction reaches its
commit command
During the execution, updates are recorded in a log file
and in cache buffers with database pages (all in RAM)
When the COMMIT is reached, before it is executed all log
updates are first written to the log file on disk, and then
the transaction commits
After that, corresponding updates are written from buffers
to the database
Deferred Update Layout
Written after
the actual
COMMIT
Main Memory
Database
Buffers
Database
database
updates
Transaction
Program
Log File
log
records
Log Buffer
Written before the
actual COMMIT
Deferred Update (continued)
If a transaction fails before reaching COMMIT, there
is no need to make any recovery
Redo of operations is needed, if a system crash
occurs after COMMIT, but before all changes are
recorded in the database on disk
Then the operations have to be redone from the log
file (that is already on disk) to the database
Using after images (item values intended to be
written to the database) does redo
Deferred update recovery log file has to contain only
after images - the new database item values
Deferred Database Modification
Transaction starts by writing <Ti start> record to
log.
A write(X) operation results in a log record <Ti, X,
V> being written, where V is the new value for X
Note: old value is not needed for this scheme
The write is not performed on X at this time, but is
deferred.
When Ti partially commits, <Ti commit> is written
to the log
Finally, the log records are read and used to
actually execute the previously deferred writes.
Deferred Database Modification (Cont.)
During recovery after a crash, a transaction needs to be redone if and
only if both <Ti start> and<Ti commit> are there in the log.
Redoing a transaction Ti ( redoTi) sets the value of all data items
updated by the transaction to the new values.
Crashes can occur while
the transaction is executing the original updates, or
while recovery action is being taken
example transactions T0 and T1 (T0 executes before T1):
T0: read (A)
T1 : read (C)
A: - A - 50
C:- C- 100
Write (A)
write (C)
read (B)
B:- B + 50
write (B)
Deferred Database Modification (Cont.)
Below we show the log as it appears at three instances of
time.
If log on stable storage at time of crash is as in case:
(a) No redo actions need to be taken
(b) redo(T0) must be performed since <T0 commit> is present
(c) redo(T0) must be performed followed by redo(T1) since
<T0 commit> and <Ti commit> are present
Immediate Database Modification
The main idea of immediate update:
The immediate database modification scheme allows database
updates of an uncommitted transaction to be made as the writes are
issued
since undoing may be needed, update logs must have both old value
and new value
Update log record must be written before database item is written
When a transaction issues an update command, before and after images
are recorded into a log file on disk, and (thereupon) the database can
(but does not have to be) immediately updated
We assume that the log record is output directly to stable storage
Can be extended to postpone log record output, so long as prior to
execution of an output(B) operation for a data block B, all log records
corresponding to items B must be flushed to stable storage
Output of updated blocks can take place at any time before or after
transaction commit
Order in which blocks are output can be different from the order in
which they are written.
Immediate Update Layout
Written
immediately
or later
Main Memory
Database
Buffers
Database
database
updates
Transaction
Program
Log File
log
records
Log Buffer
Written immediately
after each update
Immediate Database Modification (Cont.)
Recovery procedure has two operations instead of one:
undo(Ti) restores the value of all data items updated by Ti to their old
values, going backwards from the last log record for Ti
redo(Ti) sets the value of all data items updated by Ti to the new
values, going forward from the first log record for Ti
Both operations must be idempotent
That is, even if the operation is executed multiple times the effect is
the same as if it is executed once
When recovering after failure:
Needed since operations may get re-executed during recovery
Transaction Ti needs to be undone if the log contains the record
<Ti start>, but does not contain the record <Ti commit>.
Transaction Ti needs to be redone if the log contains both the record
<Ti start> and the record <Ti commit>.
Undo operations are performed first, then redo operations.
Immediate DB Modification Recovery Example
Below we show the log as it appears at three instances of time.
Recovery actions in each case above are:
(a) undo (T0): B is restored to 2000 and A to 1000.
(b) undo (T1) and redo (T0): C is restored to 700, and then A and B are
set to 950 and 2050 respectively.
(c) redo (T0) and redo (T1): A and B are set to 950 and 2050
respectively. Then C is set to 600
Checkpoints
Some DBMS use CHECKPOINT records in the log file to
prevent unnecessary redo operations
To take a checkpoint record, DBMS has to:
Temporarily to suspend operations of all transactions,
To force write results of all update operations of committed
transactions from main memory buffers to disk,
Write a checkpoint record into the log file and force write log to
disk
Resume executing transactions
Only changes made by transactions that committed between
the last checkpoint and a system failure have to be redone
Checkpoints (Cont.)
During recovery we need to consider only the most recent
transaction Ti that started before the checkpoint, and
transactions that started after Ti.
1.
2.
3.
4.
5.
Scan backwards from end of log to find the most recent
<checkpoint> record
Continue scanning backwards till a record <Ti start> is found.
Need only consider the part of log following the start record.
Earlier part of log can be ignored during recovery, and can be
erased whenever desired.
For all transactions (starting from Ti or later) with no <Ti
commit>, execute undo(Ti). (Done only in case of immediate
modification.)
Scanning forward in the log, for all transactions starting from
Ti or later with a <Ti commit>, execute redo(Ti).
Example of Checkpoints
Tf
Tc
T1
T2
T3
T4
checkpoint
system failure
T1 can be ignored (updates already output to disk due to
checkpoint)
T2 and T3 redone.
T4 undone
Shadow Paging
Shadow paging is an alternative to log-based recovery; this
scheme is useful if transactions execute serially
Idea: maintain two page tables during the lifetime of a transaction –
the current page table, and the shadow page table
Store the shadow page table in nonvolatile storage, such that state
of the database prior to transaction execution may be recovered.
Shadow page table is never modified during execution
To start with, both the page tables are identical. Only current page
table is used for data item accesses during execution of the
transaction.
Whenever any page is about to be written for the first time
A copy of this page is made onto an unused page.
The current page table is then made to point to the copy
The update is performed on the copy
Sample Page Table
Example of Shadow Paging
Shadow and current page tables after write to page 4
Shadow Paging (Cont.)
To commit a transaction :
1. Flush all modified pages in main memory to disk
2. Output current page table to disk
3. Make the current page table the new shadow page table, as
follows:
keep a pointer to the shadow page table at a fixed (known) location
on disk.
to make the current page table the new shadow page table, simply
update the pointer to point to current page table on disk
Once pointer to shadow page table has been written, transaction
is committed.
No recovery is needed after a crash — new transactions can
start right away, using the shadow page table.
Pages not pointed to from current/shadow page table should be
freed (garbage collected).
Show Paging (Cont.)
Advantages of shadow-paging over log-based schemes
no overhead of writing log records
recovery is trivial
Disadvantages :
Copying the entire page table is very expensive
Can be reduced by using a page table structured like a B+-tree
Commit overhead is high even with above extension
No need to copy entire tree, only need to copy paths in the tree that lead to
updated leaf nodes
Need to flush every updated page, and page table
Data gets fragmented (related pages get separated on disk)
After every transaction completion, the database pages containing old
versions of modified data need to be garbage collected
Hard to extend algorithm to allow transactions to run concurrently
Easier to extend log based schemes
Recovery With Concurrent Transactions
We modify the log-based recovery schemes to allow multiple
transactions to execute concurrently.
All transactions share a single disk buffer and a single log
A buffer block can have data items updated by one or more
transactions
We assume concurrency control using strict two-phase locking;
i.e. the updates of uncommitted transactions should not be visible to
other transactions
Logging is done as described earlier.
Otherwise how to perform undo if T1 updates A, then T2 updates A and
commits, and finally T1 has to abort?
Log records of different transactions may be interspersed in the log.
The checkpointing technique and actions taken on recovery have
to be changed
since several transactions may be active when a checkpoint is
performed.
Recovery With Concurrent Transactions
(Cont.)
Checkpoints are performed as before, except that the checkpoint log
record is now of the form
< checkpoint L>
where L is the list of transactions active at the time of the checkpoint
We assume no updates are in progress while the checkpoint is carried
out
When the system recovers from a crash, it first does the following:
1.
2.
Initialize undo-list and redo-list to empty
Scan the log backwards from the end, stopping when the first
<checkpoint L> record is found.
For each record found during the backward scan:
3.
if the record is <Ti commit>, add Ti to redo-list
if the record is <Ti start>, then if Ti is not in redo-list, add Ti to undo-list
For every Ti in L, if Ti is not in redo-list, add Ti to undo-list
Recovery With Concurrent Transactions
(Cont.)
At this point undo-list consists of incomplete transactions
which must be undone, and redo-list consists of finished
transactions that must be redone.
Recovery now continues as follows:
1.
Scan log backwards from most recent record, stopping when
<Ti start> records have been encountered for every Ti in undolist.
2.
3.
During the scan, perform undo for each log record that belongs to
a transaction in undo-list.
Locate the most recent <checkpoint L> record.
Scan log forwards from the <checkpoint L> record till the end
of the log.
During the scan, perform redo for each log record that belongs to a
transaction on redo-list
Example of Recovery
Go over the steps of the recovery algorithm on the
following log:
<T0 start>
<T0, A, 0, 10>
<T0 commit>
<T1 start>
/* Scan at step 1 comes up to here */
<T1, B, 0, 10>
<T2 start>
<T2, C, 0, 10>
<T2, C, 10, 20>
<checkpoint {T1, T2}>
<T3 start>
<T3, A, 10, 20>
<T3, D, 0, 10>
<T3 commit>
Log Record Buffering
Log record buffering: log records are buffered in
main memory, instead of being output directly to
stable storage.
Log records are output to stable storage when a block of
log records in the buffer is full, or a log force operation is
executed.
Log force is performed to commit a transaction by
forcing all its log records (including the commit
record) to stable storage.
Several log records can thus be output using a
single output operation, reducing the I/O cost.
Log Record Buffering (Cont.)
The rules below must be followed if log records are
buffered:
Log records are output to stable storage in the order in
which they are created.
Transaction Ti enters the commit state only when the log
record
<Ti commit> has been output to stable storage.
Before a block of data in main memory is output to the
database, all log records pertaining to data in that block
must have been output to stable storage.
This rule is called the write-ahead logging or WAL rule
Database Buffering
Database maintains an in-memory buffer of data blocks
If a block with uncommitted updates is output to disk, log records
with undo information for the updates are output to the log on stable
storage first
When a new block is needed, if buffer is full an existing block needs to be
removed from buffer
If the block chosen for removal has been updated, it must be output to
disk
(Write ahead logging)
No updates should be in progress on a block when it is output to
disk. Can be ensured as follows.
Before writing a data item, transaction acquires exclusive lock on block
containing the data item
Lock can be released once the write is completed.
Such locks held for short duration are called latches.
Before a block is output to disk, the system acquires an exclusive latch
on the block
Ensures no update can be in progress on the block
Buffer Management (Cont.)
Database buffer can be implemented either
in an area of real main-memory reserved for the database,
or
in virtual memory
Implementing buffer in reserved main-memory has
drawbacks:
Memory is partitioned before-hand between database
buffer and applications, limiting flexibility.
Needs may change, and although operating system knows
best how memory should be divided up at any time, it
cannot change the partitioning of memory.
Buffer Management (Cont.)
Database buffers are generally implemented in virtual
memory in spite of some drawbacks:
When operating system needs to evict a page that has been
modified, the page is written to swap space on disk.
When database decides to write buffer page to disk, buffer
page may be in swap space, and may have to be read from
swap space on disk and output to the database on disk,
resulting in extra I/O!
Known as dual paging problem.
Ideally when OS needs to evict a page from the buffer, it
should pass control to database, which in turn should
Output the page to database instead of to swap space (making
sure to output log records first), if it is modified
2. Release the page from the buffer, for the OS to use
Dual paging can thus be avoided, but common operating systems
do not support such functionality.
1.
Failure with Loss of Nonvolatile Storage
So far we assumed no loss of non-volatile storage
Technique similar to checkpointing used to deal with loss
of non-volatile storage
Periodically dump the entire content of the database to stable
storage
No transaction may be active during the dump procedure; a
procedure similar to checkpointing must take place
Output all log records currently residing in main memory onto stable
storage.
Output all buffer blocks onto the disk.
Copy the contents of the database to stable storage.
Output a record <dump> to log on stable storage.
Recovering from Failure of Non-Volatile
Storage
To recover from disk failure
restore database from most recent dump.
Consult the log and redo all transactions that
committed after the dump
The dump of the database is also referred
to as archival dump
Remote Backup Systems
Remote Backup Systems
Remote backup systems provide high
availability by allowing transaction processing
to continue even if the primary site is
destroyed.
Remote Backup Systems (Cont.)
Detection of failure: Backup site must detect when primary site
has failed
to distinguish primary site failure from link failure maintain several
communication links between the primary and the remote backup.
Heart-beat messages
Transfer of control:
To take over control backup site first perform recovery using its copy
of the database and all the log records it has received from the
primary.
Thus, completed transactions are redone and incomplete transactions
are rolled back.
When the backup site takes over processing it becomes the new
primary
To transfer control back to old primary when it recovers, old primary
must receive redo logs from the old backup and apply all updates
locally.
Remote Backup Systems (Cont.)
Time to recover: To reduce delay in takeover, backup site
periodically processes the redo log records (in effect, performing
recovery from previous database state), performs a checkpoint,
and can then delete earlier parts of the log.
Hot-Spare configuration permits very fast takeover:
Backup continually processes redo log record as they arrive,
applying the updates locally.
When failure of the primary is detected the backup rolls back
incomplete transactions, and is ready to process new transactions.
Alternative to remote backup: distributed database with
replicated data
Remote backup is faster and cheaper, but less tolerant to failure
Remote Backup Systems (Cont.)
Time to commit
Ensure durability of updates by delaying transaction commit until
update is logged at backup; avoid this delay by permitting lower
degrees of durability.
One-safe: commit as soon as transaction’s commit log record is
written at primary
Two-very-safe: commit when transaction’s commit log record is
written at primary and backup
Problem: updates may not arrive at backup before it takes over.
Reduces availability since transactions cannot commit if either site
fails.
Two-safe: proceed as in two-very-safe if both primary and backup
are active. If only the primary is active, the transaction commits as
soon as is commit log record is written at the primary.
Better availability than two-very-safe; avoids problem of lost
transactions in one-safe.
Database Mirroring : How it works
Application
Mirror is always
redoing – it
remains current
Witness
Commit
Principal
Mirror
1
5
2
SQL Server
2
Log
>2
Data
SQL Server
4
3
Log
>3
Data
End of Chapter