Hashing Dictionaries

Transcript Hashing Dictionaries

Dictionaries, Tables
Hashing
TCSS 342
1/51
The Dictionary ADT
• a dictionary (table) is an abstract model of a
database
• like a priority queue, a dictionary stores
key-element pairs
• the main operation supported by a
dictionary is searching by key
2/51
Examples
•
•
•
•
Telephone directory
Library catalogue
Books in print: key ISBN
FAT (File Allocation Table)
3/51
Main Issues
• Size
• Operations: search, insert, delete, ??? Create
reports??? List?
• What will be stored in the dictionary?
• How will be items identified?
4/51
The Dictionary ADT
• simple container methods:
– size()
– isEmpty()
– elements()
• query methods:
– findElement(k)
– findAllElements(k)
5/51
The Dictionary ADT
• update methods:
– insertItem(k, e)
– removeElement(k)
– removeAllElements(k)
• special element
– NO_SUCH_KEY, returned by an unsuccessful
search
6/51
Implementing a Dictionary
with a Sequence
• unordered sequence
– searching and removing takes O(n) time
– inserting takes O(1) time
– applications to log files (frequent insertions,
rare searches and removals) 34 14 12 22 18
34
14
12
22
18
7/51
Implementing a Dictionary
with a Sequence
• array-based ordered sequence
(assumes keys can be ordered)
- searching takes O(log n) time (binary search)
- inserting and removing takes O(n) time
- application to look-up tables
(frequent searches, rare insertions and removals)
12
14
18
22
34
8/51
Binary Search
• narrow down the search range in stages
• “high-low” game
• findElement(22)
2
low
4
5
7
8
9 12 14 17 19 22 25 27 28 33 37
mid
high
9/51
Binary Search
2
4
5
7
8
9 12 14 17 19 22 25 27 28 33 37
low
2
4
5
7
8
mid
high
9 12 14 17 19 22 25 27 28 33 37
low mid high
2
4
5
7
8
9 12 14 17 19 22 25 27 28 33 37
low = mid = high
10/51
Pseudocode for Binary Search
Algorithm
BinarySearch(S, k, low, high)
if low > high then
return NO_SUCH_KEY
else
mid (low+high) / 2
if k = key(mid) then
return key(mid)
else if k < key(mid) then
return BinarySearch(S, k, low, mid-1)
else
return BinarySearch(S, k, mid+1, high)
11/51
Running Time of Binary
Search
• The range of candidate items to be searched
is halved after each comparison
Comparison
Search Range
0
n
1
n/2
…
…
2i
n/2i
log 2n
1
12/51
Running Time of Binary
Search
• In the array-based implementation, access
by rank takes O(1) time, thus binary search
runs in O(log n) time
• Binary Search is applicable only to Random
Access structures (Arrays, Vectors…)
13/51
Implementations
• Sorted? Non Sorted?
• Elementary: Arrays, vectors linked lists
– Orgainization: None (log file), Sorted, Hashed
• Advanced: balanced trees
14/51
Skip Lists
• Simulate Binary Search on a linked list.
• Linked list allows easy insertion and
deletion.
• http://www.epaperpress.com/s_man.html
15/51
A FAT Example
• Directory: Key: file name. Data (time, date, size
…) location of first block in the FAT table.
• If first block is in physical location #23 (Disk
block number) look up position #23 in the FAT.
Either shows end of file or has the block number
on disk.
• Example: Directory entry: block # 4
• FAT: x x x F 5 6 10 x 23 25
3
The file occupies blocks 4,5,6,10, 3.
16/51
Hashing
• Place item with key k in position h(k).
• Hope: h(k) is 1-1.
• Requires: unique key (unless multiple items
allowed). Key must be protected from
change (use abstract class that provides only
a constructor).
• Keys must be “comparable”.
17/51
Key class
•
•
•
•
•
•
•
public abstract class KeyID {
Private Comparable searchKey;
Public KeyID(Comparable m) {
searchKey = m;
}//Only one constructor
public Comparable getSearchKey() {
return searchKey;
}
• }
18/51
Hashing Problem
• RT&T is a large phone company, and they
want to provide enhanced caller ID
capability:
–
–
–
–
given a phone number, return the caller’s name
phone numbers are in the range 0 to R = 1010–1
n is the number of phone numbers used
want to do this as efficiently as possible
19/51
Hashing Problem
• We know two ways to design this dictionary:
• abalanced search tree (AVL, red-black) or a skiplist with the phone number as the key has O(log n)
query time and O(n) space --- good space usage
and search time, but can we reduce the search time
to constant?
• abucket array indexed by the phone number has
optimal O(1) query time, but there is a huge
amount of wasted space: O(n + R)
20/51
Bucket Array
• Each cell is thought of as a bucket or a container
– Holds key element pairs
– In array A of size N, an element e with key k is inserted
in A[k].
(null)
(null)
000-000-0000 000-000-0001
…
Roberto
401-863-7639 ...
…
(null)
999-999-9999
21/51
Generalized indexing
• Hash table
– Data storage associated with a key
– The key need not be an integer
22/51
Hash Tables
• A data structure
• The location of an item is determined
– directly as a function of the item itself
– Not by a sequence of trial and error comparisons
• Commonly used to provide faster searching
– O(n) for linear searches
– O (logn) for binary search
– O(1) for hash table
23/51
Example:
• A symbol table constructed by a
compiler
– Stores identifiers and information about
them
24/51
Another Solution
• A Hash Table is an alternative solution with
O(1) expected query time and O(n + N)
space, where N is the size of the table
• Like an array, but with a function to map the
large range of keys into a smaller one
– e.g., take the original key, mod the size of the
table, and use that as an index
25/51
Example
• Insert item (401-863-7639, Roberto) into a table of
size 5
• 4018637639 mod 5 = 4, so item (401-863-7639,
Roberto) is stored in slot 4 of the table
• A lookup uses the same process: map the key to an
index, then check the array cell at that index
401863-7639
Roberto
0
1
2
3
4
26/51
Collision
• Insert (401-863-9350, Andy)
• And insert (401-863-2234, Devin). We have
a collision!
27/51
Collision Resolution
• How to deal with two keys which map to
the same cell of the array?
• Use chaining
– Set up lists of items with the same index
28/51
Chaining
0
1
2
3
4
29/51
Chaining
• The expected, search/insertion/removal time
is O(n/N), provided the indices are
uniformly distributed
• The performance of the data structure can
be fine-tuned by changing the table size N
30/51
Hash Function
• Function h defined by h(i) = i
– Determines the location of an item i in the
hash table
• Called a hash function.
• To reduce the large size of a hash table
use
• h(i) = i mod 25;
31/51
From Keys to Indices
• The mapping of keys to indices of a hash table is
called a hash function
• A hash function is usually the composition of two
maps:
– hash code map: key  integer
– compression map: integer  [0, N - 1]
• An essential requirement of the hash function is to
map equal keys to equal indices
• A “good” hash function minimizes the probability
of collisions
32/51
Java Hash
• Java provides a hashCode() method for the
Object class, which typically returns the 32bit memory address of the object.
• This default hash code would work poorly
for Integer and String objects
• The hashCode() method should be suitably
redefined by classes.
33/51
Popular Hash-Code Maps
• Integer cast: for numeric types with 32 bits
or less, we can reinterpret the bits of the
number as an int
• Component sum: for numeric types with
more than 32 bits (e.g., long and double),
we can add the 32-bit components.
34/51
Popular Hash-Code Maps
• Polynomial accumulation: for strings of a
natural language, combine the character
values (ASCII or Unicode) a 0 a 1 ... a n-1 by
viewing them as the coefficients of a
polynomial: a 0 + a 1 x + ...+ x n-1 a n-1
35/51
Popular Hash-Code Maps
– The polynomial is computed with Horner’s
rule, ignoring overflows, at a fixed value x:
a0 + x (a1 + x (a2 + ... x (an-2 + x an-1 ) ... ))
– The choice x = 33, 37, 39, or 41 gives at most 6
collisions on a vocabulary of 50,000 English
words
• Why is the component-sum hash code bad
for strings?
36/51
Random Hashing
• Random hashing
– Uses a simple random number generation
technique
– Scatters the items “randomly” throughout
the hash table
37/51
Popular Compression Maps
• Division: h(k) = |k| mod N
– the choice N =2 k is bad because not all the bits are
taken into account
– the table size N is usually chosen as a prime
number
– certain patterns in the hash codes are propagated
• Multiply, Add, and Divide (MAD):
– h(k) = |ak + b| mod N
– eliminates patterns provided a mod N 0
– same formula used in linear congruential (pseudo)
random number generators
38/51
More on Collisions
• A key is mapped to an already occupied
table location
– what to do?!?
• Use a collision handling technique
– We’ve seen Chaining
– Can also use Open Addressing
• Double Hashing
• Linear Probing
39/51
Linear Probing
• If the current location is used, try the next
table location
• linear_probing_insert(K)
if (table is full) error
probe = h(K)
while (table[probe] occupied)
probe = (probe + 1) mod M
table[probe] = K
40/51
Linear Probing
• Lookups walk along table until the key or an
empty slot is found
• Uses less memory than chaining
– don’t have to store all those links
• Slower than chaining
– may have to walk along table for a long way
• Deletion is more complex
– either mark the deleted slot
– or fill in the slot by shifting some elements down
41/51
Linear Probing Example
• h(k) = k mod 13
• Insert keys:
• 18 41 22 44 59 32 31 73
0
1
2
3
4
5
41
0
1
2
6
7
8
9
10 11 12
18 44 59 32 22 31 72
3
4
5
6
7
8
9
10 11 12
42/51
Double Hashing
• Use two hash functions
• If M is prime, eventually will examine every
position in the table
• double_hash_insert(K)
if(table is full) error
probe = h1(K)
offset = h2(K)
while (table[probe] occupied)
probe = (probe + offset) mod M
table[probe] = K
43/51
Double Hashing
• Many of same (dis)advantages as linear
probing
• Distributes keys more uniformly than linear
probing does
44/51
Double Hashing Example
• h1(K) = K mod 13
• h2(K) = 8 - K mod 8
– we want h2 to be an offset to add
– 18 41 22 44 59 32 31 73
0
1
44
0
2
3
4
41 73
1
2
3
5
6
7
8
9
10 11 12
18 32 53 31 22
4
5
6
7
8
9
10 11 12
45/51
Hash code
• static int hashCode(long i) {
return (int)((i >> 32) + (int) i);}
46/51
Hash code
• static int hashCode(String s) {
int h=0;
for (int i=0; i<s.length(); i++) {
h = (h << 5) | (h >>> 27);
// 5-bit cyclic shift of the running sum
h += (int) s.charAt(i); // add in next
character }
return h;
}
47/51
Linear Probing Hash Table
public class LinearProbingHashTable implements Dictionary {
/** Marker for deactivated buckets */
private static Item AVAILABLE = new Item(null, null);
/** number of items in the dictionary */
private int n = 0;
/** capacity of the bucket array */
private int N;
/** bucket array */
private Item[] A;
48/51
Linear Probing Hash Table
/** hash comparator */
private HashComparator h;
/** constructor providing the hash comparator */
public LinearProbingHashTable(HashComparator hc) {
h = hc;
N = 1023; // default capacity
A = new Item[N];
}
49/51
Linear Probing Hash Table
/** constructor providing the hash comparator and the
capacity
* of the bucket array */
public LinearProbingHashTable(HashComparator hc, int
bN) {
h = hc;
N = bN;
A = new Item[N];
}
50/51
Linear Probing Hash Table
// auxiliary methods
private boolean available(int i) {
return (A[i] == AVAILABLE);
}
private boolean empty(int i) {
return (A[i] == null);
}
51/51
Linear Probing Hash Table
private Object key(int i) {
return A[i].key();
}
private Object element(int i) {
return A[i].element();
}
private void check(Object k) {
if (!h.isComparable(k)) throw new
InvalidKeyException("Invalid key.");
}
52/51
Helper search method
/** helper search method */
private int findItem(Object key) throws
InvalidKeyException {
check(key);
int i = h.hashValue(key) % N;
// division method compression map
int j = i;
53/51
Helper search method
do {
if (empty(i))
return -1; // item is not found
if (available(i))
i = (i + 1) % N; // bucket is deactivated
else if (h.isEqualTo(key(i), key)) // we have found our item
return i;
else // we must keep looking
i = (i + 1) % N;
} while (i != j);
return -1; // item is not found
}
54/51
Dictionary
// methods of the dictionary ADT
public Object findElement (Object key) throws
InvalidKeyException {
int i = findItem(key); // helper method for finding a key
if (i < 0)
return Dictionary.NO_SUCH_KEY;
return element(i);
}
55/51
Dictionary
public void insertItem (Object key, Object element)
throws InvalidKeyException {
check(key);
int i = h.
hashValue(key) % N; // division method compression
map
int j = i;
56/51
Dictionary
// remember where we are starting
do {
if (empty(i) || available(i)) { // this slot is available
A[i] = new Item(key, element);
n++;
return;
}
i = (i + 1) % N; // check next slot
} while (i != j); // repeat until we return to start
throw new HashTableFullException("Hash table is
full."); }
57/51
Dictionary
public Object removeElement (Object key) throws
InvalidKeyException {
int i = findItem(key); // find this key first
if (i <0)
return Dictionary.NO_SUCH_KEY;
// nothing to remove
Object toReturn = element(i);
A[i] = AVAILABLE; // mark this slot as deactivated
n--;
return toReturn;
}
58/51