Log Tuning - GDM@Fudan
Download
Report
Transcript Log Tuning - GDM@Fudan
Log Tuning
© Dennis Shasha, Philippe Bonnet 2001
DB Architecture
Storage Subsystem
Concurrency Control
Recovery
Buffer Manager
Operating System
Hardware
[Processor(s), Disk(s), Memory]
© Dennis Shasha, Philippe Bonnet 2001
Review: The ACID properties
• Question: which properties of ACID is
related to crash recovery?
– Atomicity & Durability (and also used for
Consistency-related rollbacks)
Atomicity and Durability
COMMITTED
COMMIT
Ø
ACTIVE
BEGIN
TRANS
(running, waiting)
ROLLBACK
ABORTED
• Every transaction either
commits or aborts. It
cannot change its mind
• Even in the face of
failures:
– Effects of committed
transactions should be
permanent;
– Effects of aborted
transactions should leave no
trace.
© Dennis Shasha, Philippe Bonnet 2001
Outages
• Hardware
• Environment
– Fire in the machine room
(Credit Lyonnais, 1996)
• Operations
– Problem during regular
system administration,
configuration and operation.
• Maintenance
– Problem during system
repair and maintenance
– Fault in the physical
devices: CPU, RAM, disks,
network.
• Software
– 99% are Heisenbugs:
transient software error
related to timing or
overload. (software failures
that occurs only once but
cause system to stop)
– Heisenbugs do not appear
when the system is restarted.
© Dennis Shasha, Philippe Bonnet 2001
Outages
• A fault tolerant system
must provision for all
causes of outages (see
case studies)
• Software is the problem
From J.Gray and A.Reuters
Transaction Processing: Concepts
and Techniques
Software
Hardware
Maintenance
Operations
Environment
Unknown
– Hardware failures cause
under 10% of outages
– Heisenbugs stop the system
without damaging the data.
• Database systems protect
integrity against single
hardware failure and some
software failures.
© Dennis Shasha, Philippe Bonnet 2001
Motivation
• Atomicity:
– Transactions may abort
(“Rollback”).
• Durability:
– What if DBMS stops
running? (Causes?)
• Desired Behavior
aftersystem restarts:
– T1, T2 & T3 should be
durable.
– T4 & T5 should be aborted
(effects not seen).
© Dennis Shasha, Philippe Bonnet 2001
A simple non-logging
scheme :Assumptions
• Concurrency control is in effect.
• Updates are happening “in place”.
– i.e. data is overwritten on (deleted from) the
disk.
• A simple scheme to guarantee Atomicity &
Durability?
© Dennis Shasha, Philippe Bonnet 2001
Buffer Mgmt Plays a Key Role
• Force policy – make sure that every update is
on disk before commit.
– Provides durability without REDO logging.
– But, can cause poor performance.
• No Steal policy – no UNCOMMITED updates
are written to disk.
– Useful for ensuring atomicity without UNDO
logging.
– But can cause poor performance.
Handling the Buffer Pool
• When a Xact submit,
force every write to disk?
– Poor response time.
– But provides durability.
No Steal
Force
• When the buffer is full,
can we Steal buffer-pool
frames from
No Force
uncommited Xacts?
– If not, poor throughput.
– If so, how can we ensure
atomicity?
© Dennis Shasha, Philippe Bonnet 2001
Steal
Trivial
Desired
More on Steal and Force
• STEAL (why enforcing Atomicity is hard)
– To steal frame F: Current page in F (say P) is written to disk;
some Xact holds lock on P.
– What if the Xact with the lock on P aborts?
– Must remember the old value of P at steal time (to support
UNDOing the write to page P).
• NO FORCE (why enforcing Durability is hard)
– What if system crashes before a modified page iswritten to
disk?
– Write as little as possible, in a convenient place, at commit
time,to support REDOing modifications.
© Dennis Shasha, Philippe Bonnet 2001
Logging
• To enable REDO/UNDO: record each update in a log.
• Log: An ordered list of REDO/UNDO actions
• What should a log record contain?
–
–
–
–
–
a unique identifier – LSN (log sequence number)
Who makes the action? – XID (Transaction ID: )
Where does the action happen? – pageID, offset, length
What is changed? – old data, new data
and some control info (we will see later)
• Note: Only those log records written on disk could be used for
recovery after crash!
• REDO and UNDO information in a log.
– Sequential writes to log (put it on a separate disk).
– Minimal info written to log, so multiple updates fit in a single log page.
Current database state = current state of data on disks + log
© Dennis Shasha, Philippe Bonnet 2001
Write-Ahead Logging (WAL)
The Write-Ahead Logging Protocol:
Must force the log record for an update before the
corresponding data page gets to disk.
Guarantees Atomicity
Must write all log records for a Xact before commit.
Guarantees Durability
ARIES algorithms, developed by C.Mohan at IBM
Almaden in the early 90’s
http://www.almaden.ibm.com/u/mohan/ARIES_Impact.html
© Dennis Shasha, Philippe Bonnet 2001
WAL & the Log
• Each log record has a unique Log Sequence Number (LSN).
– LSNs always increasing.
• Each data page contains a pageLSN.
– The LSN of the most recent log record for an update to that page.
• System keeps track of flushedLSN.
– The max LSN flushed so far.
• WAL: Before a page is written,
– pageLSN ≤ flushedLSN
© Dennis Shasha, Philippe Bonnet 2001
Log Records
• Possible log record types:
–
–
–
–
Update
Commit
Abort
End (signifies end of commit or abort)
• Compensation Log Records (CLRs)
– for UNDO actions
© Dennis Shasha, Philippe Bonnet 2001
Other Log-Related State
• Transaction Table:
– One entry per active Xact.
– Contains XID, status (running/commited/aborted), and lastLSN.
• Dirty Page Table:
– Dirty page in the buffer: pages have been changed but not yet reflect on
disk
– One entry per dirty page in buffer pool.
– Contains recLSN -- the LSN of the log record which first caused the
page to be dirty.
© Dennis Shasha, Philippe Bonnet 2001
Checkpointing
• Periodically, the DBMS creates a checkpoint, in order to
minimize the time taken to recover in the event of a
system crash. Write to log:
– begin_checkpoint record: Indicates when chkpt began.
– end_checkpoint record: Contains current Xact table and
dirty page table. This is a `fuzzy checkpoint’:
• Other Xacts continue to run; so these tables accurate only as of the
time of the begin_checkpoint record.
• No attempt to force dirty pages to disk; effectiveness of
checkpoint limited by oldest unwritten change to a dirty page. (So
it’s a good idea to periodically flush dirty pages to disk!)
• Store LSN of chkpt record in a safe place (master
record).
© Dennis Shasha, Philippe Bonnet 2001
The Big Picture:
What’s Stored Where
© Dennis Shasha, Philippe Bonnet 2001
Crash Recovery: Big Picture
• Start from a checkpoint
(found via master
record).
• Three phases. Need to:
– Figure out which Xacts
committed since
checkpoint, which failed
(Analysis).
– REDO all actions.
(repeat history)
– UNDO effects of failed
Xacts.
© Dennis Shasha, Philippe Bonnet 2001
Aries for steal, no-force
approach
• There are 3 phases in the Aries recovery algorithm:
– Analysis: Scan the log forward (from the most recent checkpoint)
to identify all Xacts that were active, and all dirty pages in the
buffer pool at the time of the crash.
– Redo: Redoes all updates to dirty pages in the buffer pool, as
needed, to ensure that all logged updates are in fact carried out and
written to disk.
– Undo: The writes of all Xacts that were active at the crash are
undone (by restoring the before value of the update, which is in the
log record for the update), working backwards in the log. (Some
care must be taken to handle the case of a crash occurring during
the recovery process!)
© Dennis Shasha, Philippe Bonnet 2001
example
• Get master record
from disk, which is
LSN 00 , i.e., the
begin checkpoint
record.
© Dennis Shasha, Philippe Bonnet 2001
Analysis
• T. T.: (T1,80),and (T3,60).
• D.P.T.: (P1,20), (P2,30), (P3,40), and (P5,80).
© Dennis Shasha, Philippe Bonnet 2001
Redo
•
Redo starts from LSN20 (minimum recLSN in DPT).
–
LSN 20
•
–
LSN 30
•
–
No action
LSN 80
•
–
Redo the changes on P2
LSN 70
•
–
No action
LSN 60
•
–
Redo the change in P3
LSN 50
•
–
Redo the change in P2
LSN 40
•
–
Check whether P1 has pageLSN more than 10 or not. Since it is a committed transaction, we
probably need not redo this update.
Redo the changes on P5
LSN 90
•
No action
© Dennis Shasha, Philippe Bonnet 2001
Undo
• ToUndo consists of (80, 60).
• LSN 80
–
Undo the changes in P5. Append a CLR: Undo T1 LSN 80, set
undonextLSN = 20. Add 20 to ToUndo.
• ToUndo consists of (60, 20).
• LSN 60
– Undo the changes on P2. Append a CLR: Undo T3 LSN 60, set
undonextLSN = 40. Add 40 to ToUndo.
• ToUndo consists of (40, 20).
• LSN 40
– Undo the changes on P3. Append a CLR: Undo T3 LSN 40, T3 end
• ToUndo consists of (20).
• LSN 20
– Undo the changes on P1. Append a CLR: Undo T1 LSN 20, T1 end
© Dennis Shasha, Philippe Bonnet 2001
The after recovery
© Dennis Shasha, Philippe Bonnet 2001
What if the system crashes during
recovery
• The recovery algorithm should be run again, however,it
could be run more efficiently.
• Crash during the REDO phase:
– Some updates need not be redone if they are written to disk at the
last REDO phase.
– Note: pageLSN stored at each page indicates the most recent log
record for an update to that page
• Crash during the UNDO phase:
–
–
–
–
– Similar to crash during Xact aborts
– All the actions of the last UNDO phase should be reapplied
– the REDO phase will redo the CLRs
– then UNDO phase will continue with the undo tasks left
UNSTABLE STORAGE
DATABASE BUFFER
LOG BUFFER
lri lrj
Pi
WRITE
log records before commit
LOG
DATA
RECOVERY
Pj
WRITE
modified pages after commit
DATA
DATA
STABLE STORAGE
© Dennis Shasha, Philippe Bonnet 2001
Logging in SQL Server 2000
Log entries:
- LSN
- before and after images or
logical log
Free
Log caches
DB2 UDB v7 uses a similar scheme
Current
Flush
Log caches Log caches
DATABASE BUFFER
Waiting
processes
db
writer
Flush
Log caches
free Pi
Lazywriter
Flush queue
Synchronous I/O
free Pj
Asynchronous I/O
DATA
LOG
© Dennis Shasha, Philippe Bonnet 2001
Logging in Oracle 8i
Before images
Free list
Rollback segments
(fixed size)
Log buffer (default 32 Kb)
After images
(redo entries)
DATABASE BUFFER
Pi
Pj
LGWR
(log writer)
DBWR
(database writer)
Log File
#1
LOG
Log File
#2
Rollback
Segments
© Dennis Shasha, Philippe Bonnet 2001
DATA
Put the Log on a Separate Disk
• Writes to log occur sequentially
• Writes to disk occur (at least) 100 times
faster when they occur sequentially than
when they occur randomly
A disk that has the log should have no other data
+ sequential I/O
+ log failure independent of database failure
© Dennis Shasha, Philippe Bonnet 2001
Put the Log on a Separate Disk
• 300 000 transactions. Each
contains an insert or
update statement.
– Oracle 9i on linux
• 30 % performance
improvement if log is
located on a different disk
© Dennis Shasha, Philippe Bonnet 2001
Throughput (tuples/sec)
Group Commits
• 300 000 transactions. Each
contains an insert
statement.
350
– DB2 UDB v7.1
300
• Group Commits: Log
250
200
150
100
50
0
1
25
Size of Group Commit
records of many transactions
are written together
– Increases throughput by
reducing the number of
writes
– at the cost of increased
mean response time, since
locks will be held longer.
© Dennis Shasha, Philippe Bonnet 2001
Tuning Database Writes
• Dirty data is written to disk
– When the number of dirty pages is greater than a
given parameter (Oracle 8)
– When the number of dirty pages crosses a given
threshold (less than 3% of free pages in the
database buffer for SQL Server 7)
– When a checkpoint is performed
• Trade-off between online performance and recoverability
• At regular intervals
• When the log is full (Oracle 8).
2 - Tuning the Guts
© Dennis Shasha, Philippe Bonnet 2001
33
Tune Checkpoint Intervals
• A checkpoint (partial flush
of dirty pages to disk)
occurs at regular intervals or
when the log is full:
Throughput Ratio
1.2
1
0.8
– Impacts the performance of
on-line processing
+ Reduces time to recover from
a crash
0.6
0.4
0.2
0
0 checkpoint
4 checkpoints
• 300 000 transactions. Each
contains an insert statement.
– Oracle 8i for Windows 2000
© Dennis Shasha, Philippe Bonnet 2001
Reduce the Size of
Large Update Transactions
• Consider an update-intensive batch transaction
(concurrent access is not an issue):
It can be broken up in short transactions
(mini-batch):
+ Easy to recover
+ Does not overfill the log buffers
Example: Transaction that updates, in sorted order, all accounts that had
activity on them, in a given day.
Break-up to mini-batches each of which access 10,000 accounts and
then updates a global counter.
© Dennis Shasha, Philippe Bonnet 2001