Files and Buffer - Microsoft Research
Download
Report
Transcript Files and Buffer - Microsoft Research
Files and Buffer Manager
9:00
11:00
13:30
15:30
18:00
Aug. 2
Intro &
terminology
Reliability
Fault
tolerance
Transaction
models
Reception
Aug. 3
Aug. 4
Aug. 5
Aug. 6
TP mons
Logging &
Files &
Structured
& ORBs
res. Mgr.
Buffer Mgr.
files
Locking Res. Mgr. &
COM+
Access paths
theory
Trans. Mgr.
Locking
CICS & TP
CORBA/
Groupware
techniques & Internet
EJB + TP
Queueing
Advanced
Replication Performance
Trans. Mgr.
& TPC
Workflow Cyberbricks
Party
FREE
Chapter 15
Abstractions Provided by the File
Manager
Device independence: The file manager turns the large
variety of external storage devices, such as disks (with
their different numbers of cylinders, tracks, arms, and
read/write heads), ram-disks, tapes, and so on, into
simple abstract data types.
Allocation independence: The file manager does its
own space management for storing the data objects
presented by the client. It may store the same objects in
more than one place (replication).
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
2
Abstractions Provided by the File
Manager
Address independence: Whereas objects in main
memory are always accessed through their addresses,
the file manager provides mechanisms for associative
access. Thus, for example, the client can request
access to all records with a specified value in some
field of the record. Support for associative access
comes in many flavors, from simple mechanisms
yielding fast retrieval via the primary key up to the
expressive power of the SQL select statement.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
3
External Storage vs. Main Memory
Capacity: Main memory is usually limited to a size that
is some orders of magnitude smaller than what large
databases need.
Economics: External storage holds large volumes of
data at reasonable cost.
Durability: Main memory is volatile. External storage
devices such as magnetic or optical disks are
inherently durable and therefore are appropriate for
storing persistent objects. After a crash, recovery starts
with what is found in durable storage.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
4
External Storage vs. Main Memory
Speed: External storage devices are some orders of
magnitude slower than main memory. As a result, it is
more costly, both in terms of latency and in terms of
pathlength, to get data from external storage to the CPU
than to load data from main memory.
Functionality: Data cannot be processed directly on
external storage: they can neither be compared nor
modified “out there.”
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
5
The Storage Pyramid
current data
main
memory
stale
data
online
external
storage
Electronic RAM
and bulk
storage
Magnetic
/ optical
disks
near line
(archive)
storage
Automated archives
(e.g. optical disk
jukeboxes, tape
robots, etc.)
typical capacity
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
6
Interfacing to External Memory:
Read-Write Mapping
External Storage
File A
File C
File D
File B
read object w
read object x
/write object x
read object y
Main Memory
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
7
Interfacing to External Memory:
File Mapping
External Storage
File A
File C
File D
File B
map File C
unmap File C
Main Memory
Jim Gray, Andreas Reuter
File C
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
8
Interfacing to External Memory:
Single-Level Storage
External Storage
File A
File C
File D
File B
Virtual
memory
File A
File B
File C
File D
Explicit mapping
Main Memory
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
9
Locality and Cacheing
The movement of data through the pyramid is guided by
the principle of locality:
Locality of active data: Data that have recently been
referenced will very likely be referenced again.
Locality of passive data: Data that have not been
referenced recently will most likely not be referenced in
the future.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
10
Levels of Abstraction in a File and
Database Manager
Application
Transaction
programs
sort, join,...
setoriented
DBMS
access
Application
database
access
read, write modules
tuple
oriented
access
main
memory
manages
Tuple
management,
associative
access
logging
recovery
databaseb
uffer mgr.
Buffer management
media
and file
manager
File management
archive
manager
Archive management
block
oriented
access
online
external
memory
nearline
external
memory
Jim Gray, Andreas Reuter
manages
device
oriented
access
manages
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
11
Operations of the Basic File System
STATUS
STATUS
STATUS
STATUS
STATUS
STATUS
STATUS
create(filename, allocparmp)
delete(filename)
open(filename, ACCESSMODE, FILEID);
close(FILEID)
extend(FILEID, allocparmp)
read(FILEID, BLOCKID, BLOCKP)
readc(FILEID, BLOCKID, blockcount,
BLOCKP)
STATUS write(FILEID, BLOCKID, BLOCKP)
STATUS writec(FILEID, BLOCKID, blockcount,
BLOCKP)
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
12
Mapping Files To Disk
File D
File E
File C
Disk
1
File B
The
File A
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
denote the
address
mapping
between
disk and
files.
WICS August 2 - 6, 1999
13
Issues in Managing Disk Space
Initial allocation: When a file is created, how
many contiguous slots should be allocated to
it?
Incremental expansion: If an existing file grows
beyond the number of slots currently allocated,
how many additional contiguous blocks should
be assigned to that file?
Reorganization: When and how should the
free space on the disk be reorganized?
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
14
Extent-Based Allocation
file directory
disk-id
extent index
accum-length
primary
Disk-A
14
100
A
Jim Gray, Andreas Reuter
1.secondary 2.secondary 3.secondary
Disk-A
Disk-A
Disk-B
187
3
214
350
600
850
extent directory
Transaction Processing - Concepts and Techniques
B
WICS August 2 - 6, 1999
15
Buddy Systems
Slots (shaded extents are free)
0
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
0
1
0
0
0
1
0
1
0
1
1
0
0
1
1
1
1
0
0
0
1
0
0
1
1
0
1
0
1
0
1
1
1
1
0
0
1
1
0
1
1
1
1
0
1
1
1
1
0
1
2
3
Buddy types
Type
Free
extents
Jim Gray, Andreas Reuter
0
1
2
3
0110
0010
1100
Transaction Processing - Concepts and Techniques
1011
WICS August 2 - 6, 1999
16
Simple Mapping of Relations To Disks
File system
segments
real disks
extents
relations
file system & operating system
Jim Gray, Andreas Reuter
database system
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
17
A Usual Way of Mapping of Relations
To Disks
relations
logical disks OS-files
real disks
extents
segments
tabelspaces
operating system
Jim Gray, Andreas Reuter
database system
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
18
Principles of the Database Buffer
process of access
module
Give me
page P
bufferfix (P , ...)
return
frame
address F
directory
1
5
buffer is accessible
from the caller's
process
(shared memory)
3
determine
FILEID and
block number
Jim Gray, Andreas Reuter
2 find frame in
buffer
readdirect
4
buffer
manager
interface
buffer
storage
area
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
19
Design Options for the Buffer
Manager
Buffer per file: Each file has its own private buffer pool..
Buffer per page size: In systems with different page
(and block) sizes, there is usually at least one buffer for
each page size.
Buffer per file type: There are files like indices, which
are accessed in a significantly different way from other
files. Therefore, some systems dedicate buffers to files
depending on the access pattern and try to manage
each of them in a way that is optimal for the respective
file organization.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
20
Logic of the Buffer Manager
Search in buffer: Check if the requested page is
in the buffer. If found, return the address F of
this frame to the caller.
Find free frame: If the page is not in the buffer,
find a frame that holds no valid page.
Determine replacement victim: If no such frame
exists, determine a page that can be removed
from the buffer (in order to reuse its frame).
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
21
Logic of the Buffer Manager
Write modified page: If replacement page has
been changed, write it.
Establish frame address: Denote the start
address of the frame as F.
Determine block address: Translate the
requested PAGEID P into a FILEID and a block
number. Read the block into the frame selected.
Return: Return the frame address F to the
caller.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
22
Synchronization in the Buffer
process A
process B
bufferfix (P ,...)
bufferfix (P ,...)
process A
process B
change P to P'
change P to P"
rewrite page
give copy
to caller
rewrite page
give copy
to caller
?
a) Access module in
process A requests
access to page P ;
gets private copy.
Jim Gray, Andreas Reuter
b) Access module in
process B requests
access to page P ;
gets private copy.
c) Both processes try to rewrite
an updated version of the page,
but these versions are different.
Only the version written last
will be on disk; this is the "lost
update" anomaly.
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
23
What the Buffer Manager Does for
Synchronization
Sharing: Pages are made addressable to all
processes that run the database code.
Semaphore protection: Each requestor gets
the address of a semaphor protecting the page.
Durable storage: The access modules inform
the buffer manager if their page access has
resulted in an update of the page; the actual
write operation, however, is issued by the buffer
manager, probably at a time when the update
transaction is long gone.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
24
The Interface to the Buffer Manager
typedef struct
{PAGEID
pageid;
/* id of page in file
*/
PAGEPTR
pageaddr;
/* base addr. in buffer */
int
index;
/* record within page */
semaphore *
pagesem;
/* pointer to the sem. */
Boolean
modified;
/* caller modif. page */
Boolean
invalid;
/* destroyed page
*/
}
BUFFER_ACC_CB, *BUFFER_ACC_CBP;
/* control block for buffer access */
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
25
The Need for Fix and Unfix
transaction X
transaction Y
transaction Y
bufferfix(Q,...)
bufferfix(Q,...)
bufferfix(P,...)
return base
address in
buffer
page P
not
in buffer
read block
containing
page P
a) Transaction X
requests access
to page P; gets
base address in buffer.
Jim Gray, Andreas Reuter
return base
address in
buffer
page Q not
in buffer;
replace P
b) Transaction Y
requests access
to page Q; buffer
mgr. decides to
replace page P
Transaction Processing - Concepts and Techniques
read block
containing
page Q
c) Transaction Y
gets the base
address of Q in
the buffer - is
same as P's.
WICS August 2 - 6, 1999
26
The Fix-Use-Unfix Protocol I
FIX: The client requests access to a page
using the bufferfix interface.
USE: The client uses the page and the pointer
to the frame containing the page will remain
valid.
UNFIX: The client explicitly waives further
usage of the frame pointer; that is, it tells the
buffer manager that it no longer wants to use
that page.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
27
The Fix-Use-Unfix Protocol II
page P
use
page R
fix page P
use
use
unfix page P
fix page R
page Q
use
use
unfix page R
use
unfix page Q
fix page Q
use
use
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
28
Structure of the Buffer Manager
bufferpool
mru_page
lru_page
hash table
p
a
g
e
s
frame_index
buffer
control
block
buffer
control
block
buffer
control
block
buffer
control
block
buffer
control
block
first_bcb
next_in_hclass
buffer
control
block
frames
buffer
access
control
block
to and from client
index of frame holding the page (address pointer in
case of buffer access control block)
chain of buffer control blocks in same hash class
LRU - chain
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
29
Logging and Recovery from the
Buffer Manager's Perspective I
Transaction
Buffer
running
running
A
B
A
B
OK; old state in DB
A
B
A
B
OK; old state in DB
running
A
B
A
B
database corrupted
running
A
B
A
B
conflicting view on TA
committed
committed
A
B
A
B
OK; Read-only TA
A
B
A
B
DB not in new state
committed
A
B
A
B
committed
A
B
A
B
database corrupted
OK; new state in DB
Jim Gray, Andreas Reuter
Database
Remark
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
30
Logging and Recovery from the
Buffer Manager's Perspective II
state of
transaction
TA
state of
page A in
database
result of recovery
using operation log
aborted
old
wrong tuple might be deleted
aborted
new
inverse operation succeeds
committed
old
operation succeeds
committed
new
duplicate of tuple is inserted
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
31
The Log and Page LSNs
page A LSN1
page A LSN3
write to disk
write to disk
buffer
manager
log
manager log record for
page A
LSN1
log record for
page A
LSN2
log record for
page A
LSN3
log record for
page A
LSN4
time
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
32
Different Buffer Management Policies
Steal policy: When the buffer manager needs
space, it can decide to replace dirty pages.
No-Steal policy: Pages can be replaced only if
they are clean.
Force policy: At end of transaction, all
modified pages are forced to disk in a series of
synchronous write operations.
No-Force policy: No modified page is forced
during commit. REDO log records are written to
the log.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
33
The Problem of Hotspot Pages
durable storage
force
page A
bufferpool
update
operations
TA1
TA2
TA3
TA4
TA5
TA6
TA7
TA8
log
The dotted arrows indicate an update of the page by the respective
transaction.The arrows at 45 degrees indicate the forced writing of the page
during commit processing.The downward arrows indicate the writing of log
records for the respective transaction.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
34
The Basic Checkpoint Algorithm
Quiesce: Delay all incoming update DML calls
until all fixes with exclusive semaphores have
been released.
Flush the buffer: Write all modified pages.
Log the checkpoint: Write a record to the log,
saying that a checkpoint has been generated.
Resume normal operation: The bufferfix requests
for updates that have been delayed in order to
take the checkpoint can now be processed again.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
35
The Case for Indirect Checkpointing
checkpoints
c1
c2
c3
1
2
3
4
5
6
7
8
9
10
framenumbers
log
When taking a checkpoint, the PAGEIDs of the pages
currently in buffer are written to the log.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
36
The Indirect Checkpointing Algorithm
Record TOC: Log the list of PAGEIDs.
Compare with prev. ckpt: See if any modified
pages have not been replaced since last ckpt.
Force lazy pages: Schedule the writing of those
pages during the next checkpoint interval.
Low-water mark: Find the LSN of the oldest
still-volatile update; write it to the log.
Write “Checkpoint done” record
Resume normal operation
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
37
Further Possibilities for Optimization
Pre-flushing can be performed by an
asynchronous process that scans the buffer for
"old" modified pages. Writing is done under
semaphore protection.
Pre-fetching can, among other things, be used
to make restart more efficient. If page reads are
logged one can use the recent checkpoint plus
the log to prime the bufferpool, i.e. it will look
almost exactly like at the moment of the crash.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
38
Further Possibilities for Optimization
Transaction scheduling and buffer management
can take hints from the query optimizer:
This relation will be scanned sequentially.
This is a sequential scan of the leaves of a Btree.
This is the traversal of a B-tree, starting at the
root.
This is a nested-loop join, where the inner
relation is scanned in physically sequential
order.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
39