Concurrent Cache-Oblivious B-trees using Transactional Memory
Download
Report
Transcript Concurrent Cache-Oblivious B-trees using Transactional Memory
Concurrent Cache-Oblivious
B-trees Using Transactional
Memory
Jim Sukha
Bradley Kuszmaul
MIT CSAIL
June 10, 2006
Thought Experiment
Imagine that, one day, you are assigned the
following task:
Enclosed is code for a serial, cacheoblivious B-tree. We want a reasonably
efficient parallel implementation that
works for disk-resident data.
Attach:
COB-tree.tar.gz
PS. We want to be able to restore the data
to a consistent state after a crash too.
PPS. Our deadline is next week. Good luck!
Concurrent COB-tree?
Question:
How can one program a concurrent,
cache-oblivious B-tree?
Approach:
We employ transactional memory. What
complications does I/O introduce?
Potential Pitfalls Involving I/O
Suppose our data structure resides on disk.
1. We might need to make explicit I/O calls to
transfer blocks between memory and disk.
But a cache-oblivious algorithm doesn’t know
the block size B!
2. We might need buffer management code if the
data doesn’t fit into main memory.
3. We might need to unroll I/O if we abort a
transaction that has already written to disk.
Our Solution: Libxac
• We have implemented Libxac, a pagebased transactional memory system that
operates on disk-resident data. Libxac
supports ACID transactions on a memorymapped file.
• Using Libxac, we are able to implement a
complex data structure that operates on
disk-resident data, e.g. a cache-oblivious
B-tree.
Libxac Handles Transaction I/O
1.
We might need to make explicit I/O calls to transfer blocks
between memory and disk.
Similar to mmap, Libxac provides a function xMmap. Thus,
we can operate on disk-resident data without knowing
block size.
2.
We might need buffer management code if the data
doesn’t fit into main memory.
Like mmap, the OS automatically buffers pages in memory.
3.
We might need to unroll I/O if we abort a transaction that
has already written to disk.
Since Libxac implements multiversion concurrency
control, we still have the original version of a page even if
a transaction aborts.
Outline
• Programming with Libxac
• Cache-Oblivious B-trees
Example Program with Libxac
int main(void) {
int* x;
int status = FAILURE;
xInit(“/logs”, DURABLE);
Runtime initialization function. For
durable transactions, logs are stored
in the specified directory.*
x = xMmap(“input.db”, 4096);
while (status != SUCCESS) {
xbegin();
x[0] ++;
status = xend();
}
xMunmap(x);
xShutdown();
Transactionally maps the first
page of the input file.
Transaction body. The body can be
a complex function (e.g., a cacheoblivious B-tree insert!).
Unmap the region.
Shutdown runtime.
return 0;
}
* Currently Libxac logs the transaction commits, but we
haven’t implemented the recovery program yet.
Libxac Memory Model
int main(void) {
int* x;
int status = FAILURE;
xInit(“/logs”, DURABLE);
x = xMmap(“input.db”, 4096);
while (status != SUCCESS) {
xbegin();
x[0] ++;
status = xend();
}
xMunmap(x);
xShutdown();
return 0;
}
*Libxac supports
concurrent
transactions on
multiple processes,
not threads.
1. Aborted transactions are
visible to the programmer
(thus, programmer must
explicitly retry transaction).
Control flow always proceeds
from xbegin() to xend().
Thus, the xaction body can
contain system/library calls.
2. At xend(), all changes to
xMmap’ed region are
discarded on FAILURE, or
committed on SUCCESS.
3. Aborted transactions always
see consistent state. Readonly transactions can always
succeed.
Implementation Sketch
• Libxac detects memory accesses by using a
SIGSEGV handler to catch a memory protection
violation on a page that has been mmap’ed.
• This mechanism is slow for normal transactions:
– Time for mmap, SIGSEGV handler: ~ 10 ms
• Efficient if we must perform disk I/O to log
transaction commits.
– Time to access disk:
~ 10 ms
Is xMmap practical?
Experiment on a 4proc. AMD Opteron,
performing 100,000
insertions of
elements with
random keys into a
B-tree.
Each insert is a
separate
transaction.
Libxac and BDB
both implement
group commit.
B-tree and COB-tree both use Libxac. Note that none of the three data
structures have been properly tuned.
Conclusion: We should achieve good performance.
Outline
• Programming with Libxac
• Cache-Oblivious B-trees
What is a Cache-Oblivious B-tree?
• A cache-oblivious B-tree (e.g. [BDFC00]) is a
dynamic dictionary data structure that supports
searches, insertions/deletions, and rangequeries.
• An cache-oblivious algorithm/data structure
does not know system parameters (e.g. the
block size B.)
• Theorem [FLPR99]: a cache-oblivious algorithm
that is optimal for a two-level memory hierarchy
is also optimal for a multi-level hierarchy.
Cache-Oblivious B-Tree Example
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
38
21
-- -- 21 --
38
23 24 31 38
54
45
39 40 -- 45
54
-- 48 54 --
83
56 59 70 83
Packed Memory Array (PMA)
The COB-tree can be divided into two pieces:
1. A packed memory array that stores the data in order, but contains gaps.
2. A static cache-oblivious binary-tree that indexes the packed memory array.
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
To insert a key of 37:
38
21
-- -- 21 --
38
23 24 31 38
54
45
39 40 -- 45
54
-- 48 54 --
83
56 59 70 83
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
38
21
-- -- 21 --
38
23 24 31 38
54
45
39 40 -- 45
To insert a key of 37:
37
1. Find correct section of PMA location using static tree.
54
-- 48 54 --
83
56 59 70 83
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
38
21
-- -- 21 --
38
23 24 31 38
54
45
39 40 -- 45
54
-- 48 54 --
To insert a key of 37:
37
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step may cause a rebalance of the PMA.
83
56 59 70 83
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
38
21
-- -- 21 --
38
23 24 31 37
54
45
38 39 40 --
To insert a key of 37:
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step possibly requires a rebalance.
3. Fix the static tree.
54
45 48 54 56
83
59 70 83 --
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
40
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
37
21
-- -- 21 --
37
23 24 31 37
56
40
38 39 40 --
To insert a key of 37:
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step possibly requires a rebalance.
3. Fix the static tree.
56
45 48 54 56
83
59 70 83 --
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
40
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
37
21
-- -- 21 --
37
23 24 31 37
56
40
38 39 40 --
56
45 48 54 56
Insert is a complex operation. If we wanted to use locks,
what is the locking protocol? What is the right (cacheoblivious?) lock granularity?
83
59 70 83 --
Conclusions
A page-based TM system such as Libxac
• Represents a good match for disk-resident
data structures.
– The per-page overheads of TM are small
compared to cost of I/O.
• Is easy to program with.
– Libxac allows us to program a concurrent,
disk-resident data structure with ACID
properties, as though it was stored in memory.
Semantics of Local Variables
int main(void) {
int y=0, z=0, a=0, b=0;
int* x;
int status = FAILURE;
xInit(“/logs”, DURABLE);
x = xMmap(“input.db”, 4096);
while (status != SUCCESS) {
a++;
xbegin();
b = x[0];
y++;
x[0]++;
z = x[0] – 1;
status = xend();
}
xMunmap(x);
xShutdown();
return 0;
}
In this system, Libxac guarantees
that after loop completes:
a == y
Value of a is # of times transaction
is attempted.
We always have b == z because
aborted transactions always see
consistent state, even if other
programs concurrently access the
first page of input.db.
TM System Improvements?
Possible improvements to Libxac:
– Provide more efficient support for non-durable
transactions by modifying the OS to track
report pages accessed?
– Integrate Libxac with another TM system to
provide concurrency control on both multiple
threads and multiple processes?
Implementation Sketch
1
x[0]
PROT_NONE
7
x[1024]
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
PROT_NONE
2
x[2048]
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Buffer File
Log File
Implementation Sketch
1
x[0]
PROT_NONE
PROT_READ
7
x[1024]
PROT_NONE
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
Segmentation
Fault
2
x[2048]
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Buffer File
Log File
Implementation Sketch
1
x[0]
PROT_NONE
PROT_READ
7
x[1024]
PROT_NONE
PROT_READ
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
Segmentation
Fault
2
x[2048]
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Buffer File
Log File
Implementation Sketch
1
x[0]
PROT_NONE
PROT_READ
7
x[1024]
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
Segmentation
Fault (2nd)
PROT_NONE
PROT_READ
2
x[2048]
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Buffer File
Log File
Implementation Sketch
1
x[0]
PROT_NONE
PROT_READ
7
x[1024]
PROT_READ
copy contents
2
x[2048]
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
Segmentation
Fault (2nd)
7
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Buffer File
Log File
Implementation Sketch
1
x[0]
PROT_NONE
PROT_READ
7
x[1024]
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
Segmentation
Fault (2nd)
PROT_READ|
PROT_WRITE
2
x[2048]
7
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Buffer File
Log File
Implementation Sketch
1
x[0]
PROT_NONE
PROT_READ
7
x[1024]
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
PROT_READ|
PROT_WRITE
2
x[2048]
2
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Buffer File
Log File
Implementation Sketch
1
x[0]
PROT_NONE
PROT_READ
7
x[1024]
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
PROT_READ|
PROT_WRITE
2
x[2048]
2
log on
disk
2
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Buffer File
Log File
Implementation Sketch
1
x[0]
PROT_NONE
21
x[1024]
int a;
xbegin();
a = x[0];
x[1024] += a+1;
xend();
PROT_NONE
copy contents
2
x[2048]
2
2
Buffer File
Log File
PROT_NONE
9
x[3072]
PROT_NONE
Memory Map
input.txt
Focus on PMA Rebalance
insert (tree, key, value) {
xbegin();
x=find_location_in_pma(tree->static_index,
key);
insert_into_pma(tree->pma, key, value, x);
fix_static_index(tree->static_index);
xend();
}
rebalance(tree->pma);
In this talk, we illustrate the problems of transaction
I/O considering a transactional rebalance of the
packed memory array.
Rebalance of an In-Memory Array
void rebalance(int* x, int n) {
int i;
int count = 0;
for (i = 0; i < n; i++) {
if (x[i] != EMPTY_SLOT) {
x[count] = x[i];
count++;
}
}
int j = count-1;
double spacing = 1.0*n/count;
for (i = n-1; i >= 0; i--) {
if (floor(j*spacing) == i) {
x[i] = x[j];
}
else {
x[i] = EMPTY_SLOT;
}
}
}
0 -- 1 2 3 -- -- 4 -- 5 6 --
// Slide everything left
0 1 2 3 4 5 6 4 -- 5 6 --
//
//
Redistribute items
from right
0 1 -- 2 -- 3 4 -- 5 -- 6 --
Rebalance with Explicit I/O
void rebalance_with_I/O(int n) {
int i;
int count = 0;
int* y; int* z;
Why do we want to avoid
performing explicit I/O to
read/write data blocks?
y = read_block(0);
z = read_block(0);
}
for (i = 0; i < n; i++) {
if (i % B == 0) {
y = read_block(i/B);
}
if (y[i%B] != EMPTY_SLOT) {
if (count % B == 0) {
z = read_block(count/B);
}
z[count%B] = y[i%B];
count++;
}
}
...
write_block(…)
Issues:
1. What if the data does
not fit into memory?
We must have
buffer management
code somewhere.
2. A cache-oblivious
algorithm does not
know the value of B!
Rebalance using Memory Mapping
void rebalance_with_mmap(int n) {
int i;
int count = 0;
x = mmap(“input.db”, n*sizeof(int));
for (i = 0; i < n; i++) {
if (x[i] != EMPTY_SLOT) {
x[count] = x[i];
count++;
}
}
...
munmap(x, n*sizeof(int));
}
Using mmap, the code looks like the
in-memory rebalance.
But we still need concurrency control.
1. What if the data does
not fit into memory?
If we use memory
mapping, then the OS
automatically buffers
pages that are
accessed.
2. What value of B do we
choose for a cacheoblivious algorithm?
I/O is transparent to the user.
B does not appear in the
application code.
Concurrent Rebalance?
void rebalance_with_mmap(int n) {
What happens if we
int i;
want the rebalance to
int count = 0;
x = mmap(“input.db”, n*sizeof(int));
occur concurrently?
for (i = 0; i < n; i++) {
if (x[i] != EMPTY_SLOT) {
x[count] = x[i];
If we use locks, what do we
count++;
choose as the locking
}
}
granularity?
...
write_block(…)
munmap(x, n*sizeof(int));
}
If we use transactions, will
the system need to unroll I/O
when a transaction aborts to
ensure the data on disk is
consistent?
Transactional Memory Mapping
void rebalance_with_xMmap(int n) {
int i;
Our solution:
int count = 0;
x = xMmap(“input.db”, n*sizeof(int));
xbegin();
for (i = 0; i < n; i++) {
Replace mmap with xMmap,
if (x[i] != EMPTY_SLOT) {
x[count] = x[i];
and use transactions for
count++;
concurrency control.
}
}
Transaction system
write_block(…)
...
xend();
xMunmap(x, n*sizeof(int));
maintains multiple versions
of a page to avoid needing
to unroll I/O.
}
Transactional memory mapping simplifies the code for
a concurrent disk-resident data structure.
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
38
16
13 15 -- 16
21
-- -- 21 --
2(a) Add 37 to packed memory array.
38
23 24 31 37
54
45
54
38 39 40 45
-- 48 54 --
83
56 59 70 83
2.5 ≤ n ≤ 7.5
6 ≤ n ≤ 14
Packed Memory Array
Density Thresholds
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
38
16
13 15 -- 16
21
-- -- 21 --
2(a) Add 37 to packed memory array.
2(b) Rebalance the PMA.
38
23 24 31 37
54
45
54
38 39 40 --
45 48 54 56
83
59 70 83 --
2.5 ≤ n ≤ 7.5
6 ≤ n ≤ 14
Packed Memory Array
Density Thresholds
Dictionary Operations using a B+-Tree
4 10 20 42
B
1
3
4
5
8 10
12 15 20
21 33 35 41 42
51 70 77 85
The branching factor of the tree and the size of a block on
disk are both q(B). For a B+-tree, the data is stored at the
leaves. The keys at an interior node represent the maximum
key of that node’s subtree.
But to build the tree, we need to know the value of B…
Cache-Oblivious B-tree [BDFC00]
Operations
Search(key)
Insert(key, value)
Delete(key, value)
RangeQuery(start, end)
Cost in Block Transfers
O(logB N)
O(logB N)**
O(logB N)**
O(logB N + k/B)*
*Bound assumes range query finds k items with keys between start and end.
** Amortized bound.
It is possible to support dictionary operations cacheobliviously, i.e., with a data structure that does not know the
value of B. The cache-oblivious B-tree (COB-tree) achieves
the same asymptotic (amortized) bounds as a B+-tree.
Cache-Oblivious B-Tree Overview
Static Cache-Oblivious
Binary Tree
The static tree is used as an index into a packed memory array.
To perform an insert, insert into the packed memory array, and update the static
tree. When the packed memory array becomes too full (empty), rebuild and grow
(shrink) the entire data structure.
Static Cache-Oblivious Binary Tree
Static Cache-Oblivious
Binary Tree:
size q(N1/2)
…
1
2
size q(N1/4)
N1/4
3
…
…
1
2
3
1
…
N1/4
1
2
3
2
…
N1/4
1
2
3
…
N1/4
1
2
3
Divide tree into q(N1/2) subtrees of size q(N1/2). Layout
each subtree contiguously in memory, recursively.
3
N1/2
N1/4
Packed Memory Array
A packed memory array uses a contiguous section of
memory to store elements in order, but with gaps.
For sections of size 2k, gaps are spaced to maintain to
specified density thresholds that become arithmetically
stricter as k increases.
1 -- -- 4
6 7 10 --
[4/16, 16/16]
13 15 -- 16
-- -- 21 --
23 24 31 38
[5/16, 15/16]
38 40 -- 45
-- 48 54 --
[6/16, 14/16]
[7/16, 13/16]
Density Thresholds Example:
24:
25:
26:
27:
Density between 4/16 and 16/16.
Density between 5/16 and 15/16.
Density between 6/16 and 13/16.
Density between 7/16 and 12/16
56 59 70 83
Example Cache-Oblivious B-Tree
Static CacheOblivious
Tree
21
10
45
4
16
4
1 -- -- 4
10
6 7 10 --
[4/16, 16/16]
Packed Memory Array
Density Thresholds
16
13 15 -- 16
38
21
-- -- 21 --
38
23 24 31 38
[5/16, 15/16]
[7/16, 13/16]
54
45
39 40 -- 45
54
-- 48 54 --
[6/16, 14/16]
83
56 59 70 83
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
Insert 37:
16
10
6 7 10 --
16
13 15 -- 16
38
21
-- -- 21 --
38
23 24 31 38
54
45
39 40 -- 45
54
-- 48 54 --
83
56 59 70 83
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
38
21
-- -- 21 --
38
23 24 31 38
54
45
39 40 -- 45
Insert 37:
37
1. Find correct section of PMA location using static tree.
54
-- 48 54 --
83
56 59 70 83
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
38
21
-- -- 21 --
38
23 24 31 38
54
45
39 40 -- 45
Insert 37:
37
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step possibly requires a rebalance.
54
-- 48 54 --
83
56 59 70 83
Example Cache-Oblivious B-Tree
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
38
16
13 15 -- 16
21
-- -- 21 --
2(a) Add 37 to packed memory array.
38
23 24 31 37
54
45
38 39 40 45
54
-- 48 54 --
[5/16, 15/16]
[6/16, 14/16]
83
56 59 70 83
Example Cache-Oblivious B-Tree
Static CacheOblivious
Tree
21
10
45
4
4
1 -- -- 4
16
10
6 7 10 --
38
16
13 15 -- 16
21
-- -- 21 --
2(a) Add 37 to packed memory array.
2(b) Rebalance the PMA.
38
23 24 31 37
54
45
38 39 40 --
54
45 48 54 56
[5/16, 15/16]
[6/16, 14/16]
83
59 70 83 --
Cache-Oblivious B-Tree Insert
Static CacheOblivious
Tree
21
10
40
4
4
1 -- -- 4
16
10
6 7 10 --
16
13 15 -- 16
37
21
-- -- 21 --
37
23 24 31 37
56
40
38 39 40 --
Insert 37:
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step possibly requires a rebalance.
3. Fix the static tree.
56
45 48 54 56
83
59 70 83 --