Transcript ppt

An Approach to Generalized
Hashing
Michael Klipper
With
Dan Blandford
Guy Blelloch
Hashing techniques currently
available
 Many hashing algorithms out there:
 Separate chaining
 Cuckoo hashing
 FKS perfect hashing
 Also many hash functions designed,
including several universal families
 Good: O(1) expected amortized time for
updates, and many have O(1) worst case
time for searches
 Bad: Require fixed-length keys and fixedlength data
What’s so bad about fixed length?
 Easy to waste a lot of space:
 Every hash bucket must be as large as
the largest item to be stored in the table.
 This is a large problem for sparsely-filled
tables, or tables where large items occur
infrequently.
 Hash tables are often building blocks
to more complicated structures, so
optimizing them pays off in a lot of
places.
Example: A Graph Layout Where
We Store Edges in a Hash Table
Let’s say u is a vertex of degree d and v1, … vd are its
neighbors. Let’s say that v0 = vd+1 = u by convention. Then
the entry representing the edge (u, vi) has key (u, vi) and
data (vi-1, vi+1).
Hash Table
This extra
entry “starts”
the list.
Degree of
Vertex
u
u
u
v1
u
v2
u
v3
v4
u
v1
v2
v1
v2
v3
v4
4
An Idea for Compression
 Instead of ((u, vi), (vi-1, vi+1)) in the table,
we will store
((u, vi – u), (vi-1 – u, vi+1 – u)).
 With this representation, we need O(kn)
space where k = S(u,v)E log |u – v|.
 A good labeling of the vertices will make
many of these differences small! But not
all of them. The following paper has
details:
D. Blandford, G. E. Blelloch, and I. Kash. Compact
Representations of Separable Graphs. In SODA, 2003, pages
342-351.
First, a simpler problem
 Variable-length data stored in arrays
 It’s like a hash table except that the
indices now are in the fixed range
0…n-1 for n items in the array.
We’ll use the following data for our example in these slides:
(0, 10110)
(1, 0110)
(2, 11111)
(3, 0101)
(4, 1100)
(5, 010)
(6, 11011)
(7, 00001111)
We’ll assume that the word size of the machine is 2 bytes.
Key Idea: BLOCKS
 Multiple data items can be crammed into a
word, so let’s take advantage of that.
 Two words in a block: one with data, one
marking off separations of strings
 If the first index in a block is i, we’ll label
the block as bi
b0
1st word 1 0 1 1 0 0 1 1 0 1 1 1 1 1
2nd word 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
This is the block containing strings s0 through s2 from our example.
Organization of Blocks
 Index structure
(regular array): A[i]
= 1 if and only if
string #i starts a
block
 Hash table (one of
the regular kind): if
string #i starts a
block, H(i) = address
of bi
 Note that it is easy to
split and merge
blocks.
Key size invariant: two adjacent
blocks (like b0 and b3 in the
example) must have their sizes
sum to greater than the word
size of the machine
A
1
0
0
1
0
0
0
1
H(0)
H(3)
H(7)
b0
b3
b7
A Rough Look at Space and Time
Bounds for this Array Structure
Let’s say we have n items, and w is the
word size in bits of the machine. WLOG all
data strings are nonempty.
Let m = Si |si|.
<= w
apart
Lookup tables cut the time down to constant
time for finding a block and the string inside
it, since they can operate on entire words at
once.
Indexing structures + hash table use O(w) bits
per block. Each block is on average half full due
to the invariant.
O(m + w) bits used and operations are O(1) time!
1
0
0
1
0
0
0
1
At most
w strings
On avg
block is
½ full
O(m/w + 1)
blocks!
Briefly, how we proceed from there
 We can finally implement our
generalized hash table using an array
of the type we just described as the
hash table.
 There are more details: the following
paper explains this.
D. Blandford and G. E. Blelloch. Storing Variable-Length Keys
in Arrays, Sets, and Dictionaries, with Applications. In
Symposium on Discrete Algorithms (SODA), 2005 (hopefully)
Great, but if there’s a paper written on the
subject already, then what do I do?
 A lot of this code isn’t yet written. We
haven’t yet checked to see that the code
we have fulfills the theoretical bounds,
since we have to make sure that any
“cutting corners” done for the programming
is theoretically safe.
 My job is to get a lot of this running and
look for optimizations.
 Also, once this is running, we’ll want to run
experiments to see how well it runs,
especially in modeling graphs.