A Locality-Preserving Cache-Oblivious Dynamic Dictionary

Download Report

Transcript A Locality-Preserving Cache-Oblivious Dynamic Dictionary

A Locality-Preserving CacheOblivious Dynamic Dictionary
By
Michael A. Bender, Ziyang Duan, John Iacono, Jing Wu
Presented By
Gregory Giguashvili
[email protected]
Abstract
• Simple dictionary designed for a
hierarchical memory layouts
• Cache-oblivious DS has memory
performance optimized for all levels of
memory hierarchy – no parameterization
• Locality-preserving dictionary maintains
elements of similar key values stored close
together
20 July 2015
Presented by Gregory Giguashvili
2
Abstract
• Presented a simplification of the cache-oblivious
B-tree of Bender, Demain and Farach-Colton
• Given N elements and B block size
– Search operations: O(logB N )
log22 N
)
– Insert and delete operations: O (log B N 
B
– Scan k data items: O(logB N  k / B)
• Implementation and evaluation of the DS is
presented on a simulated memory hierarchy
20 July 2015
Presented by Gregory Giguashvili
3
Memory Models
• Two-level memory model
– Disk Access Model (DAM) consists of internal
memory of size M and external disk partitioned into
B-sized blocks
– Focus on particular memory models, thus not portable
and not flexible
• Cache-oblivious memory model
– Reason about two-level, but prove results for
unknown multilevel memory models
– Parameters B and M are unknown, thus optimized for
all levels of memory hierarchy
20 July 2015
Presented by Gregory Giguashvili
4
B-trees
• B-tree is the classic external-memory
search tree.
• B-tree supports insert, delete, search and
scan on two level memory hierarchy
(memory and disk)
• Balanced tree with fan-out proportional to
disk-block size B
20 July 2015
Presented by Gregory Giguashvili
5
B-trees Formally
• B-tree is a rooted tree
• Every node x has the following fields
• n[x] – number of keys stored in x
• The keys sorted in non-decreasing order
• Leaf[x] – Boolean value: TRUE is x is leaf
• Internal nodes contain pointers to children
• The keys separate the ranges of keys stored
in each sub-tree
20 July 2015
Presented by Gregory Giguashvili
6
B-trees Formally
• Every leaf has the same depth (tree’s height)
• Lower and upper bounds on number of keys per
node (minimum degree t2)
– Every node (other than root) must have at least t-1
keys thus t children
– Every node can contain at most 2t-1 keys thus 2t
children
• Simplest B-tree occurs when t=2 (2-3-4 tree)
• In practice, much larger values of t are used
20 July 2015
Presented by Gregory Giguashvili
7
B-trees Example
Search for R
M
DH
BC
20 July 2015
FG
QTX
JKL
NP
RS
Presented by Gregory Giguashvili
VW
YZ
8
B-trees
• Given B-tree with N nodes and B
branching factor
– Tree search:  (log B N )
– Insert or Delete key: O(h)
NW
PQRSTUV
20 July 2015
NSW
Split a node
(t=4)
PQR
Presented by Gregory Giguashvili
TUV
9
B-trees
• Depend critically on block-size B thus
optimized only for two levels of memory
• Theoretically, multilevel B-tree can be
created, but it is very complex
• Improving data layout can optimize B-tree
structures (place logically close data
physically near each other)
• Bad performance for sub-optimal B values
20 July 2015
Presented by Gregory Giguashvili
10
Theory of COM
• Based on ideal-cache model of Frigo, Leiserson,
Prokop and Ramachandran
–
–
–
–
Fully associative cache of size C*B
Disk partitioned into B-sized memory blocks
Block is fetched from disk to cache on cache-miss
Ideal memory block is elected for replacement if cache
is full
• Proved to work with a small constant overhead
• Used by cache-oblivious algorithms to analyze
memory transfers between cache levels
20 July 2015
Presented by Gregory Giguashvili
11
Data Structure
• Data array
– Data and linear number of blank entries
• Index array
– Encoding of a tree that indexes the data
• Search and update operations involve basic
manipulations on these arrays
• Relatively simple to implement
20 July 2015
Presented by Gregory Giguashvili
12
Related Work
• Cache-oblivious search tree by Brodal, Fagerberg
and Jacob (same complexity)
– Balanced tree of height logN + O(1)
– Layout in array of size O(N)
• Non-data-structure problems
– Matrix multiplication, transpose, LU decomposition,
FFT, sorting
• Keep N elements ordered given O(N) memory by
Itai, Konheim and Rodeh in priority queues
20 July 2015
Presented by Gregory Giguashvili
13
Data Structure Operations
• Dynamic set S with keys from totally ordered
universe
– Insert(x): adds x to S
– Delete(x): removes x from S
– Predecessor(x): gets item from S that has the largest
key value in S at most x
– ScanForward(): gets successor of the most recently
accessed item in S
– ScanBackward(): gets predecessor of the most
recently accessed item in S
20 July 2015
Presented by Gregory Giguashvili
14
Data Structure Description
• Data array is a packed-memory structure
log22 N
 1)
– Insertions and deletions require: O (
B
– Scan of k elements requires O(k/B+1)
• Index array is a static tree structure
– Each leaf corresponds to an item in data array
– Updating the index tree does not dominate
data insertion cost
20 July 2015
Presented by Gregory Giguashvili
15
Packed-Memory Maintenance
• N totally ordered elements to be sorted in array of
size O(N)
– Element x precedes y iff x < y
– Any set of k contiguous elements is stored in a
contiguous sub-array of size O(k)
• Scan any set of k contiguous elements uses
O(k/B + 1) memory transfers
log22 N
• Insert or delete an element uses O (
 1)
B
amortized memory transfers
20 July 2015
Presented by Gregory Giguashvili
16
Packed-Memory Maintenance
• When a window of the array becomes too
full or empty, the elements are spread
evenly within a larger window
– Window sizes range from O(logN) to O(N)
and are powers of two
– Window sizes are associated with density
thresholds (lower LT and upper UT bounds )
– When deviation from threshold is discovered,
the densities of the windows are adjusted
20 July 2015
Presented by Gregory Giguashvili
17
Packed-Memory Maintenance
• Rebalancing
– Window of size 2 k is overflowing if the
number of data elements in the region exceeds
2 k *UT
– Window of size 2 k is under flowing if the
number of data elements in the region is less
than 2 k * LT
• Take elements of the window and
rearrange them to be spread out evenly
20 July 2015
Presented by Gregory Giguashvili
18
Packed-Memory Maintenance
• Insertion of x at location A[j]
– Examine windows containing A[j] of size 2 k for
k=loglogN,…,logN
– Take first window that is not overflowing
– Insert the element x and rebalance this window
• Deletion of x from location A[j]
– Examine windows containing A[j] of size 2 k for
k=loglogN,…,logN
– Take first window that is not underflowing
– Delete the element x and rebalance this window
20 July 2015
Presented by Gregory Giguashvili
19
Static Tree Maintenance
• Cache-oblivious static tree structure due to
Prokop
• Given a complete binary tree, describe
mapping from the nodes to positions of an
array (van Emde Boas layout)
– Traverse from root to a leaf in an N node tree
in  (logB N ) memory transfers (optimal)
20 July 2015
Presented by Gregory Giguashvili
20
Static Tree Maintenance
• Given a tree with N items and h=log(N+1)
–
–
–
–
Split the tree at the middle
A top recursive sub-tree A of height h/2
Bottom recursive sub-trees B1,…,Bk of height h/2
k  O ( N ) and all sub-trees contain O ( N ) nodes
• Partition array into k regions, one for each
A, B1 ,..., Bk
• Recursively apply this mapping
• Stop when the trees have one node
20 July 2015
Presented by Gregory Giguashvili
21
Static Tree Maintenance
1
A
2
4
5
A
B1
Van Emde Boas
7
6
8
B1
B2
3
10
9 11
B3
B2
13
12 14
B3
B4
15
B4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Plain Array
20 July 2015
Presented by Gregory Giguashvili
22
Cache-Oblivious Structure
• “Array” refers to the dynamic packed-memory structure
storing the data
• “Tree” refers to the static cache-oblivious tree that serves
as an index
• N data items are stored in “array” of size O(N) (sorted,
some blank entries)
• There are pointers between array position A[i] and leaf Ti
for all values of i (A[i] and Ti store the same key value)
• Internal nodes of T store the maximum of non-blank key
values of its children
20 July 2015
Presented by Gregory Giguashvili
23
Predecessor Operation
• Traverse the tree from root to a leaf
• Similar to standard predecessor search in
binary tree
• Reaching node u, we decide where to
branch by comparing the search key to the
key of u’s left child
20 July 2015
Presented by Gregory Giguashvili
24
Predecessor Operation
Theorem 1. The operation predecessor uses
O(log N ) block transfers
B
Proof: Search in our structure is similar to a root-to
leaf traversal in the “tree”. The process of
examining the key value of the current node’s left
child at every step might cause additional
O(logB N )
20 July 2015
Presented by Gregory Giguashvili
25
Scan Forward/Backward Operation
• Scan forward (backward) the next nonblank item in the “array” from the last item
accessed
• We only scan O(1) elements because of the
density constraint of the “array” to find the
non-blank element
20 July 2015
Presented by Gregory Giguashvili
26
Scan Forward/Backward Operation
Theorem 2. A sequence of k scan operations
uses O(k / B  1) block transfers
Proof: A sequence of k scan operations only
accesses O(k) consecutive elements in the
“array” in order.
20 July 2015
Presented by Gregory Giguashvili
27
Insert/Delete Operation
• Insert and delete operations proceed in the
same manner – we describe only insert
• Insert x in three stages:
– Perform Predecessor(x) to find the location to
insert
– Insert x into the “array” using packed-memory
insertion algorithm
– Update key values in the “tree”
20 July 2015
Presented by Gregory Giguashvili
28
Insert/Delete Operation
• The first two steps are straightforward
• Update the key values in the “tree”:
– Copy the updated key from “array” to “tree” into the
corresponding locations
– Update all the ancestors of the updated leaves in postorder traversal
– Change key values of the node to reflect the
maximum of key values of the children
– For a given node, the values of its children have been
already updated (because of post-order traversal)
20 July 2015
Presented by Gregory Giguashvili
29
Insert/Delete Operation
Lemma 1. To perform a post-order traversal
on k leaves in the tree and their ancestors
k
requires O (log N  ) block transfers.
B
Proof: Consider horizontal stripes formed by
the sub-trees, which are smaller than B. We
w  O(logstripes
N)
pass
on any root-to-leaf
path. Number the stripes from the bottom
of the tree starting at 1.
B
B
20 July 2015
Presented by Gregory Giguashvili
30
Insert/Delete Operation
| A | B
| B1 | B
| Bk | B
B1
20 July 2015
Bk
Presented by Gregory Giguashvili
stripes
A
w
w-1
w-2
…
2
1
31
Insert/Delete Operation
Proof (cont.): All items in any stripe are accessible
in two memory transfers (MT).
We analyze the cost of accessing one of stripe-2
trees and all of its stripe-1 tree children. The size
of these trees is in range B... B . Since all the
stripe-2 tree children are stored consecutively,
post order access of k items will cause 1  2Bk
page faults (provided memory can hold 2 blocks).
20 July 2015
Presented by Gregory Giguashvili
32
Insert/Delete Operation
Proof (cont.): We might need one more memory
transfer for interleaving accesses. Thus, 2  2Bk
block transfers are performed (provided memory
can hold 4 blocks).
In general, access of k consecutive items in postorder mostly consists of level-1 and level-2 subk
log
N

trees. There are at most
items
B
accessed at stripes 3 and higher. Each of these
accesses cause 1 memory transfer (provided
memory can hold 5 blocks).
B
20 July 2015
Presented by Gregory Giguashvili
33
Insert/Delete Operation
Proof (cont.): We end up with at most
k
2  log N  3 block transfers to access k
B
consecutive items in tree in post-order,
given a cache of 5 blocks.
B
20 July 2015
Presented by Gregory Giguashvili
34
Insert/Delete Operation
Theorem 3. The number of block transfers caused
by Insert(x) and Delete(x) operations is
log2 N
O (log B N 
)
B
Proof: The predecessor query costs O(logB N )
Insert into packed-memory array costs O (1  logB N )
amortized. Let w be the number of items changed
in “array”. Updating the internal nodes in the
w
“tree” uses O (log N  B ) memory transfers
(Lemma 1).
2
B
20 July 2015
Presented by Gregory Giguashvili
35
Insert/Delete Operation
w
 1)
B
Proof (cont.): Since
is asymptotically
the same as the number of block transfers
during insertion into “array”, it may be
log N
O
(
1

) amortized cost of
restated as
B
insertion into the packed-memory
structure. Therefore, the operation takes:
O(
2
log2 N
log2 N
O (log B N )  O (1 
)  O (log B N 
)
B
B
2
 O (log B N 
20 July 2015
log N
)
B
Presented by Gregory Giguashvili
36
Simulation Results
• How the block size B and cache size M
affect DS compared to B-tree???
• Up to 1,000,000 elements inserted
• Each element is 32-bit integer
• If array becomes too full, recopy elements
to a larger array
20 July 2015
Presented by Gregory Giguashvili
37
Simulation Input Patterns
• Insert at head
– Insert in the beginning of the array
– Worst case behavior
• Random insert
– Insert before a random element in the array
– Assign random 32-bit key to new element
• Bulk insert
– Insert a sequence of elements at random position
20 July 2015
Presented by Gregory Giguashvili
38
Experiments Description
• Packed-memory structure
– Amortized number of moved elements per insertions
– Different thresholds and densities considered
• Memory simulator
– Model blocks in memory and on disk
– LRU strategy for block replacement
• Measure page faults by packed-memory and
index structures separately
20 July 2015
Presented by Gregory Giguashvili
39
Scans and Moves
• Count number of elements moved during
rebalance operation
• Density parameters of 50% up to 90%
• Average number of moves during head insertions
~350 per 1,000,000 elements
• Random insertions have a small constant
overhead
• Number of moves increases with the bulk size
20 July 2015
Presented by Gregory Giguashvili
40
Page Faults
• Packed-memory structure (insert at head)
– More sensitive to B than to N
– Almost linear dependence on size of B
– Gets worse if rebalance window is less than cache size
• Packed-memory structure (random insert)
– Close to the worst case (1 fault per insert)
• Packed-memory structure (bulk insert)
– Insert cost is amortized by the number of elements
inserted
20 July 2015
Presented by Gregory Giguashvili
41
Page Faults
• Entire structure (searching index+scanning the
array)
– Consider bulk sizes 1,10,100,…,1000000
– Worst case is the random inserts
– Increasing bulk size gives order of magnitude
performance increase
– Larger bulk sizes hurt performance while rebalancing
the windows
– Performance is never worse than random insertions
20 July 2015
Presented by Gregory Giguashvili
42
B-tree Simulations
• Data entries are 32-bit integers
• Each node contains at most B data and B+1
branches
• Tested with different B and bulk sizes
– Insert at head is the best case
– Keeping B close to page size improves performance
– Searching a random key in tree of 1,000,000 elements
with page size of 1024 bytes, gives average page fault
rate of 3.77 (against 3.69 in CODS).
20 July 2015
Presented by Gregory Giguashvili
43
Conclusions
• Advantages over standard B-tree
– Cache-oblivious
– Locality preserving
– Two static arrays
• Worst-case performance CODS is as good
as of B-tree for typical B and M
• Optimized over multiple level memory
hierarchies
20 July 2015
Presented by Gregory Giguashvili
44
20 July 2015
Presented by Gregory Giguashvili
45