Transcript empty table

CS121 Data Structures
Tables
An abstract table, T, contains table entries that are either
•empty, or
•pairs of the form (K, I) where K is a key and I is some
information associated with key K.
CS121 © JAS 2004
CS121 Data Structures
The operations we may require are:
1. Initialise the table, T, to be the empty table.
The empty table can be seen as filled with entries (Ko, Io) where Ko is a
special empty key, distinct from all other nonempty keys.
2. Determine whether or not the table, T, is full.
3. Insert a new entry (K,I), into the table, T, provided T is not
already full.
4. Delete the table entry (K,I) from the table T.
5. Given, K, retrieve the information, I, from the entry (K,I)
6. Update the table entry (K,I) in table, T, by replacing it with a
new table entry (K,I')
7. Enumerate the table entries (K,I) in table, T, in increasing order
of their keys, K.
CS121 © JAS 2004
CS121 Data Structures
Implementing a Table using a (Binary Search) Tree
•
Initialise – create empty tree
•
Empty – test for empty tree (root is nil)
•
Full – tree is full only when memory exhausted
•
Insertion – add entry to tree
•
Deletion – delete a node (and reshape tree)
•
Retrieve – search tree
•
Update – search for entry and modify it
•
Enumerate – traverse tree in appropriate order
Initialise, Empty – O(1)
Insert, Delete, Retrieve, Update – O(logn)
Enumerate – O(n)
CS121 © JAS 2004
CS121 Data Structures
Hash Tables
Principle – calculate a value (hash value) from the key to
determine the position of the entry in the table
If the key field is alphabetic a simple hash function is to
associate an integer value with each letter; sum them and
return a value modulo the table size
CS121 © JAS 2004
CS121 Data Structures
• Associate the value 1 with a, 2 with b, and so on up to 26
with z
• Let the table size be 10 (indexes 0..9)
cat
dog
test
cat
3+1+20 = 24
mod 10 = 4
dog
4+15+7 = 26
mod 10 = 6
CS121 © JAS 2004
test
20+5+19+20 = 64
mod 10 = 4
CS121 Data Structures
• Unless our table size is such that it is big enough to contain
all possible entries we will get collisions
CS121 © JAS 2004
CS121 Data Structures
Probability of Collisions
von Mises Birthday Paradox
• for 23 or more people, there is a greater than 50% chance
that two or more will have the same birthday.
• for 88 or more – three or more will have the same birthday.
• Hence for a 365 entry table (based on birthdays), after 23
insertions there will be a greater than 50% chance that an
insertion will result in a collision.
• This is counter-intuitive because the table is in fact only
23/365 = 6.3% full.
• This is known as the load factor.
CS121 © JAS 2004
CS121 Data Structures
The Theory
For M people – probability same birthday P(M)
– probability they don’t Q(M).
Q(M) = 1 – P(M).
For M = 1
Q(1) = 1 (nobody to share with!)
For M = 2
Q(2) = 364/365
(only 364 days not the other person’s birthday)
For M = 3
Q(3) = (364x363) / (3652)
For M > 1
Q(M) = (364x353x…x(366-M)) / (365 (M-1))
P(22) = 0.476
P(23) = 0.507
CS121 © JAS 2004
CS121 Data Structures
• When a collision occurs we need to find an alternative
space for the entry
• The simplest method is a linear search – look forward for
the next empty space – open addressing
CS121 © JAS 2004
CS121 Data Structures
Primary clustering
CS121 © JAS 2004
CS121 Data Structures
Solution to clustering is to use a different search path for
different key values
- double hashing
If the initial hash function is based on remainder then a
typical second hash would be based on the quotient
if a <> b and a mod t = b mod t
then a div t <> b div t
CS121 © JAS 2004
CS121 Data Structures
Recall our initial example and let the rehash be divide by table
size
test
cat
dog
test
cat
3+1+20 = 24
mod 10 = 4
dog
4+15+7 = 26
mod 10 = 6
CS121 © JAS 2004
test
20+5+19+20 = 64
mod 10 = 4
div 10 = 6
CS121 Data Structures
The same principles of hashing and collision resolution are
used for both insertion and retrieval/update.
The series of entries that are examined to see if they are
empty is called the probe sequence.
Insertion terminates when we find an empty entry to insert
into.
Retrieval/update terminates when either we find the entry we
are looking for, or we have searched the whole table.
CS121 © JAS 2004
CS121 Data Structures
Need to ensure that all possible entries are probed.
Obvious for a linear probe sequence (increment by 1).
For a double hash probe sequence it is less obvious unless the table
size is a prime number (or table is even and probe step size is
odd).
To check whole table has been searched we can count the number of
probes.
Alternative is define a full table as one with one empty space. A
search thus ends when an empty space has been found.
This requires the insertion routine to know that it doesn’t fill the last
space.
CS121 © JAS 2004
CS121 Data Structures
Some code:Hashing by Open Addressing
public class OpenAddress
{ private Object[] Elt;
private int Tablesize;
private Object EmptyCell;
CS121 © JAS 2004
CS121 Data Structures
OpenAddress(int tablesize)
{ int index;
Tablesize = tablesize;
EmptyCell = null;
Elt = new Object[tablesize];
for (index=0;index<tablesize;++index)
Elt[index] = EmptyCell;
}
CS121 © JAS 2004
CS121 Data Structures
public void Store (Object newElt)
throws OAException
{ int index = Hash (newElt.key);
probe = Hash2 (newElt.key);
for(int cnt = 1; cnt < Tablesize; cnt++)
{ if (Elt[index].equals(EmptyCell))
{Elt[index] = newElt; return;}
else if (Elt[index].key.equals(newElt.key))
throw new OAException(“already in”);
else index = (index+probe)%Tablesize;
} throw new OAException(“table full”);
}
CS121 © JAS 2004
CS121 Data Structures
public Object Retrieve(Object SearchKey)
throws OAException
{ int index = Hash(Searchkey);
probe = Hash2(Searchkey);
for (int cnt = 1; cnt < Tablesize; cnt++0
{ if (Elt[index].equals(EmptyCell))
throw new OAException(“not in table”);
else if (Elt[index].key.equals(SearchKey))
return Elt[index];
else index = (index + probe)%Tablesize;
} throw new OAException(“not in table”);
}
CS121 © JAS 2004
CS121 Data Structures
Alternative approach is to use chaining
cat
test
cat
dog
dog
test
CS121 © JAS 2004
CS121 Data Structures
Operations on a table
Initialisation
• Chaining – set chain pointers to nil – O(n)
• Open addressing – set keys to empty key – O(n)
Insert, Retrieval, Update
• Chaining – use hash function, then search linked list – O(s)
where s is average list length
• Open addressing – use hash functions to find entry
– O(k) where k is some constant based on complexity of
hash functions
CS121 © JAS 2004
CS121 Data Structures
Deletion
• Chaining – delete entry from list – O(s)
• Open addressing – must mark entry has having been
deleted to avoid stopping search early – O(k)
Enumeration
• Not normally feasible
CS121 © JAS 2004
CS121 Data Structures
Design of Hash functions – step one generate a key value –
preconditioning
(1) Using codes for letters –
a=1
ascii
posn
cat 3+1+20=24 99+97+116=312
3*1282+2*1281+20*1280
=49428
tac 20+1+3=24 116+97+99=312
20*1282+2*1281+3*1280
=327939
CS121 © JAS 2004
CS121 Data Structures
(2) Using bit patterns
cat
tac
1100011 1100001 1110100 = 1634548
1110100 1100001 1100011 = 1913059
If the bit patterns get too long then a section of the bit pattern can
be selected to use for the hash function.
CS121 © JAS 2004
CS121 Data Structures
Step two – generate an address from the key value
• Division by table size
• Folding – The key is divided into sections, and the sections
are added together.
– if Key = 013402122 ; hash(Key) = 013+402+122 = 537
– addition, subtraction and multiplication could be used
• Midsquare – square the key and select a number of bits from
the middle
– if Key = 12345; Key2 = 152399025; select middle – 399
• Truncate – select last n digits
– if Key = 13402122; select last 3 digits – 122
CS121 © JAS 2004
CS121 Data Structures
Hash Table Reordering
When a hash table is nearly full many items are not at the
locations given by their hash address so many comparisons
are made to locate them, and if an item is not present an
entire list of rehash positions must be searched.
To address this problem the table can be reordered.
CS121 © JAS 2004
CS121 Data Structures
Ordered Hash Tables (Amble and Knuth)
All items that hash to the same address are maintained in
descending order of the key. Assume that a nil entry is less
than all possible keys.
A search can then terminate whenever a key less than that
being searched for is found.
On insertion when a collision occurs the keys are compared
and the rehash/search for free space process continues with
the smaller key.
CS121 © JAS 2004
CS121 Data Structures
0
1
2
3
H(cat)=(3+1+20)/9=24/9=6; rh=2
H(act)=(1+3+20)/9=24/9=6; rh=2
H(tac)=(20+1+3)/9=24/9=6; rh=2
4
5
6
cat act
tac
7
8
CS121 © JAS 2004
CS121 Data Structures
Ordered Hash Tables reduce the average number of probes to
determine an entry is in the table.
However, the number of probes for insertion is not reduced
and equals the number required for an unsuccessful search
in an unordered table.
Also ordered tables require significant data movement.
CS121 © JAS 2004
CS121 Data Structures
Brent’s Method
Works with double hashing.
Rehash search key until empty slot found.
Then consider all keys in rehash path and determine if placing
one of these in an empty slot would require fewer
rehashes.
If so place search key in this position and move existing key
to empty rehash slot.
CS121 © JAS 2004
CS121 Data Structures
0
1
2
3
4
5
6
7
8
CS121 © JAS 2004
CS121 Data Structures
Brent’s method reduces the average number of probes for a
successful search, but not for an unsuccessful search.
Insertion is also more complex.
Brent’s method can be applied recursively – the displaced
item is again inserted using Brent’s method (checking keys
on rehash path for possible replacement – but if carried to
extreme would yield unacceptably long insertion times.
CS121 © JAS 2004
CS121 Data Structures
Consider a hash function which used the initial letter of the
key word and inserted into a table of size 26
If collisions are resolved by chaining we have an index table
– indexed by initial letter
If the chains are replaced by the same hash structure using the
second letter of the key word as the hash function
- we have a 26-ary tree
CS121 © JAS 2004
CS121 Data Structures
With an index table (or a 26-ary tree!) we regain the property
that we can list the entries in sorted form
However, it is also possible to combine hash tables, trees and
index tables
EgUse an index table according to first letter, then use hash
tables for each letter sub-table
CS121 © JAS 2004
CS121 Data Structures
CS121 © JAS 2004