Transcript B-tree

External Memory Data Structures
Srinivasa Rao Satti
Workshop on Recent Advances in Data Structures
December 20, 2011
Fundamental Algorithmic Problems
• Searching: Given a list (sequence) L of elements x1, x2, .., xn
and query element x, check whether x is present in L.
– When L is not sorted, we use linear search – scan the list to
check if x is present in it.
– When L is sorted, we use binary search – divide the remaining
list to be searched in half with every comparison.
Also insert and delete elements to/from L.
• Sorting: Given a sequence of elements, sort them in increasing (or
decreasing) order.
– Insertion sort, bubble sort, quick sort, merge sort
2
Random Access Machine (RAM) Model
• Standard theoretical model of computation:
– Infinite memory
– Uniform access cost
R
A
M
• Unit-cost RAM model: All the basic operations (reading/writing a
location from/to the memory, standard arithmetic and Boolean
operations) take one unit of time.
• Simple model crucial for success of computer industry.
3
Hierarchical Memory
L
1
L
2
R
A
M
• Modern machines have complicated memory hierarchy
– Levels get larger and slower further away from CPU
– Data moved between levels using large blocks
4
Hard disk drive
5
Slow I/O
• Disk access is 106 times slower than main memory access
track
read/write head
read/write
arm
“The difference
in speed
between modern CPU and
disk technologies is
analogous to the difference
in speed in sharpening a
pencil using a sharpener on
magnetic surface
one’s desk or by taking an
airplane to the other side of
– Disk systems try to amortize large access time
largea
the transferring
world and using
contiguous blocks of data (8-16Kbytes)
sharpener on someone else’s
• Important to store/access data to take advantage ofdesk.”
blocks(D.
(locality)
Comer)
6
External Memory Model
D
Block I/O
M
P
[Aggarwal-Vitter 1988]
N = # of items in the problem instance
B = # of items per disk block
M = # of items that fit in main memory
T = # of items in output
I/O: Move block between memory and disk
Performance measures:
Space: # of disk blocks used by the structure
Time: # of I/Os performed by the algorithm
(CPU time is “free”)
8
Scalability Problems: Block Access Matters
• Example: Traversing linked list
– Array size N = 10 elements
– Disk block size B = 2 elements
– Main memory size M = 4 elements (2 blocks)
1 5 2 6 3 8 9 4 7 10
Algorithm 1: N=10 I/Os
1 2 10 9 5 6 3 4 8 7
Algorithm 2: N/B=5 I/Os
• Large difference between N and N/B since block size is large
– Example: N = 256 x 106, B = 8000 , 1ms disk access time
 N I/Os take 256 x 103 sec = 4266 min = 71 hr
 N/B I/Os take 256/8 sec = 32 sec
9
Queues and Stacks
• Queue:
– Maintain push and pop blocks in main memory
Push
Pop

O(1/B) Push/Pop operations
• Stack:
– Maintain push/pop blocks in main memory

O(1/B) Push/Pop operations
10
Fundamental Bounds
• Scanning:
• Sorting:
• Searching:
Internal
N
N log N
log2 N
External
N
B
N
B
logM B
N
B
logB N
• Note:
– Linear I/O: O(N/B)
– B factor VERY important:
– Cannot sort optimally with search tree
11
Search trees: API
• Given a set S of keys, support the operations:
– search(x) : return TRUE if x is in S, and FALSE otherwise
– insert(x) : insert x into S (error if x is already in S)
– delete(x) : delete x from S (error if x is not in S)
– rangesearch(x,y) : return all the keys z such that x ≤ z ≤ y
14
Binary Search Trees
• Binary search tree:
– Standard method for search among N elements
– We assume elements in leaves
(log2 N )
– Search traces a root-to-leaf path
– If nodes are stored arbitrarily on disk
 Search in O (log2 N ) I/Os
 Rangesearch in O(log2 N  T ) I/Os
15
External Search Trees
(log2 B)
(B )
• BFS blocking:
– Block height O(log2 N ) / O(log2 B)  O(logB N )
– Output elements blocked

Rangesearch in (logB N  T B) I/Os
• Optimal: O(N/B) space and (logB N  T B) query
16
External Search Trees
• Maintaining BFS blocking during updates?
– Balance is normally maintained in search trees using rotations
x
y
y
x
• Seems very difficult to maintain BFS blocking during rotation
– Also need to make sure output (leaves) is blocked!
17
B-trees
• BFS-blocking naturally corresponds to tree with fan-out (B )
• B-trees balanced by allowing node degree to vary
– Rebalancing performed by splitting and merging nodes
18
(a,b)-tree
• T is an (a,b)-tree (a≥2 and b≥2a-1)
– All leaves on the same level and
contain between a and b elements
– Except for the root, all nodes have
degree between a and b
– Root has degree between 2 and b
(2,4)-tree
• (a,b)-tree uses linear space and has height O (loga N )

Choosing a,b =(B) each node/leaf stored in one disk block

(N/B) space and (logB N  T B) query
19
(a,b)-Tree Insert
• Insert:
Search and insert element in leaf v
DO { if v has b+1 elements/children
Split v:
make nodes v’ and v’’ with
b21   b and b21   a elements
insert element (ref) in parent(v)
(make new root if necessary)
v=parent(v) }
v
b 1
v’
v’’
b21  b21 
• Insert touches (loga N ) nodes
20
(2,4)-Tree Insert
21
(a,b)-Tree Delete
• Delete:
Search and delete element from leaf v
DO { if v has a-1 elements/children
Fuse v with sibling v’:
move children of v’ to v
delete element (ref) from parent(v)
(delete root if necessary)
If v has >b (and ≤ a+b-1<2b) children split v
v=parent(v) }
v’
v
a -1
v
 2a -1
• Delete touches O (loga N ) nodes
22
(2,4)-Tree Delete
23
Summary/Conclusion: B-tree
• B-trees: (a,b)-trees with a,b = (B )
– O(N/B) space
– O(logB N+T/B) I/Os for search and rangesearch
– O(logB N) I/Os for insert and delete
• B-trees with elements in the leaves sometimes called B+-tree
• Construction in O( NB logM B NB ) I/Os
– Sort elements and construct leaves
– Build tree level-by-level bottom-up
24
B-tree Construction
• In internal memory we can sort N elements in O(N log N) time using
a balanced search tree:
– Insert all elements one-by-one (construct tree)
– Output in sorted order using in-order traversal
• Same algorithm using B-tree use O( N logB N ) I/Os
log MB
– A factor of O ( B log B ) non-optimal
• As discussed we could build B-tree bottom-up in O( NB logM B NB ) I/Os
– In general we would like to have dynamic data structure to use in
O( NB logM B NB )algorithms  O( B1 logM B NB ) I/O operations
25
Flash memory
30
Flash memory
31
Flash memory
• Non-volatile memory which can be erased and programmed
• Characteristics:
– Lighter
– Provides better shock resistance
– Provides more throughput
– Consumes less power
– More denser (uses less space)
compared to magnetic disks
• Commonly used in digital cameras, handheld computers, mobile
phones, portable music players etc.
• Also used in embedded systems, sensor networks; and even
replacing magnetic disks in PCs.
32
HDD vs SSD
The disassembled components of a hard disk drive (left)
and of the PCB and components of a solid-state drive (right)
33
Limitations of flash memory
• Memory cells in a flash memory device can be written only a
limited number of times
– between 10,000 and 1,000,000, after which they wear out and
become unreliable.
• The only way to set bits (change their value from 0 to 1) is to erase
an entire region memory. These regions have fixed size in a given
device, typically ranging from several kilobytes to hundreds of
kilobytes, and are called erase units.
• Two different types of Flash memories: NOR and NAND
– they have slightly different characteristics
34
Flash memory
• The memory space of the chip is partitioned into blocks called erase
blocks. The only way to change a bit from 0 to 1 is to erase the
entire unit containing the bit.
• Each block is further partitioned into pages, which usually store
2048 bytes of data and 64 bytes of meta-data. Erase blocks typically
contain 32 or 64 pages.
• Bits are changed from 1 to 0 by programming (writing) data onto a
page. An erased page can be programmed only a small number of
times (1 to 3) before it must be erased again.
35
Flash memory
• Reading data takes tens of microseconds for the first access to a
page, plus tens of nanoseconds per byte.
• Writing a page takes hundreds of microseconds, plus tens of
nanoseconds per byte.
• Erasing a block takes several milliseconds.
• Each block can sustain only a limited number of erasures.
Algorithms/data structures designed for I/O model do not always
work well when implemented on flash memory.
36
Flash memory models (I)
• General flash model:
BR
M
c Flash
BW
• The complexity of an algorithm is x + c · y, where x and y are the
number of read and write I/Os respectively, and c is a penalty factor
for writing.
• Typically, we assume that BR < BW << M, and c ≥ 1.
37
Flash memory models (II)
• Unit-cost flash model:
BR
Flash
M
BW
• General flash model augmented with the assumption of an equal
access time per element for reading and writing.
• The cost of an algorithm performing x read I/Os and y write I/Os is
given by x.BR + y.BW.
• This simplifies the model considerably, as it becomes easier to
adapt external-memory results.
38
B-trees on flash memory
• An insertion in a B-tree updates a single leaf (unless the leaf splits)
• Since we cannot perform an in-place update in flash memory, we
need to create a new copy of the leaf, with the new element inserted.
• Since the parent of this leaf has to update its pointer to the leaf, we
need to create a new copy of the parent. And so on..up to the root.
• Thus the write performance is quite bad for the naïve
implementation.
39
Flash Translation Layer (FTL)
• Software layer on the flash disk which performs logical to physical
block mapping.
• Distributes writes uniformly across blocks.
• B-tree with FTL:
– All nodes contain just the logical address of other nodes
– Allows any update to write just the target node
• Achieves one erase per update (amortized)
40
μ-tree
[Kang, Jung, Kang, Kim, 2007]
• Minimally Updated tree
• Achieves similar performance as ‘B-tree with FTL’ on raw flash
• Sizes of the nodes decreases exponentially from leaf to the root
• Each block corresponds to a leaf-to-root path, and stores the nodes
on a prefix of this path
• Works only when log2 B ≥ logB N
41
FD-tree
[Li, He, Yang, Luo, Yi, 2010]
• Flash Disk aware tree index
• Transforms random writes into sequential writes
• Limits random writes to within a small region
42
FD-tree
•Flash Disk aware tree index
•Transforms random writes into sequential writes
•Contains a head tree and a few levels of sorted runs of increasing sizes
•O(logk N) levels, where k is the size ratio between levels
43
Other B-tree indexes for flash memory
• BFTL [Wu, Luo, Chang, 2007]
• Lazy Adaptive tree [Agrawal, Ganesan, Sitaraman, Diao, Singh,
2009]
• Lazy Update tree [On, Hu, Li, Xu, 2009]
• In-page Logging approach [Lee, Moon, 2007]
• …
All these are designed to get better practical performance, and take
different aspects of flash characteristics into consideration.
-- not easy to compare with each other
44
Comparison of tree indexes on flash
N – number of elements
BR – read block size
BW – write block size
BU – size of buffer
h – height of the tree
k - parameter
45
Directions for further research
• The area is still in its infancy.
• Not much is work has been done apart from the development of
some file systems and tree indexing structures
• Open problems:
– Efficient tree indexes for flash memory
– Tons of other (practically significant) algorithmic problems
– Better memory model.
46
Thank You
47