DSLec(Hashing). - CSE246DataStructures

Download Report

Transcript DSLec(Hashing). - CSE246DataStructures

Data Structures and Algorithms
Lecture (Searching)
Instructor: Quratulain
Date: 4 and 8 December, 2009
Faculty of Computer Science, IBA
Basic search terminologies





A table or file is a group of elements, called a
record.
Each record is associated with key to
differentiate the record.
Key could be internal or embedded with table. In
other case could be external (outside of table).
For every record, the unique key is called a
primary key.
A table of records in which a key is used for
retrieval is often called a search table or
dictionary.
Basic search terminologies
A table is organized with specific search
technique.
 The table may be contained completely in
memory, completely in auxiliary storage
or it may be divided between the two.
 In this course we will concentrate on
internal search (completely in memory).

Dictionary as an Abstract Data Type
A dictionary is an ordered or unordered list of keyelement pairs, where keys are used to locate elements in
the list.
Example:
◦ consider a data structure that stores Student
information; it can be viewed as a dictionary, where
Student ID serve as keys for identification of Student
objects.
 Operation on Dictionary
◦ Size(), empty(), finditem(key), removeitem(),
insertitem(key, element)
 Note Java has a built-in abstract class java.util.Dictionary

Search Techniques for Dictionary ADT





Sequential search
Indexed sequential search (book handout)
Binary Search
Hashing
Trees searching
◦
◦
◦
◦
◦
Binary search trees
Multiway search trees
B-trees
AVL trees
Red black trees
Sequential Search
Simplest form of search algorithm for
dictionary, implemented either in array or
linklist.
 The algorithm examine each key; upon
finding one that matches the search
argument, its index or value is returned.
 In worst case, number of comparison are
O(n).

Indexed Sequential Search
This technique increase the search
efficiency for a sorted file but need more
space to store indexes.
 These indexes can be applicable to both
arrays and linklist.

Binary Search
Most efficient search technique.
 Use sorted array.
 Cannot implement using linklist
 The worst case complexity is O(logn)

Hashing

Hashing is a method for directly referencing an
element in a table by performing arithmetic
transformations on keys into table addresses. This
is carried out in two steps:
Step 1: Computing the so-called hash function
H: K -> A.
Step 2: Collision resolution, which handles cases
where two or more different keys hash to the
same table address.
Hashing





Hashing is function that maps each key to a
location in memory.
Minimize the number of comparison to insert a
data.
A key’s location does not depend on other
elements, and does not change after insertion.
 unlike a sorted list
A good hash function should be easy to compute,
minimize hash collision, and uniformly distributed
records in table.
With such a hash function, the dictionary
operations can be implemented in O(1) time.
Hashing





Suppose the company has100 inventory record then two
digit indices of array will be sufficient.
But if the 100 inventory with 5-digit of code then we have
to use 5-digit indices of array but store only 100 records
will waste the memory.
Map key values to hash table addresses
keys -> hash table address
This applies to find, insert, and remove
Usually: integers -> {0, 1, 2, …, Hsize-1}
Typical example: f(n) = n mod Hsize
Non-numeric keys converted to numbers
◦ For example, strings converted to numbers as
 Sum of ASCII values
 First three characters
Hash collision
When h(k1)=h(k2) then two values
cannot be place on one location this is
called Hash Collision or hash clash
 Two method to deal with clashes.

◦ Rehashing (use secondary hash function that contunue until new
location found)
◦ Chaining ( build a link list of all items with
same hash key)
Resolve hash clashes using Open
Address(Rehashing)
The simplest method of resolving clashes is to
place the record in the next available position in
the array which is still open. This technique of
rehashing is called linear probing.
 Rehash function

◦ rh=(h(k) +1 )%m [cover all indices]
◦ rh=(h(k) +2 )%m [cover only even indices]

Loop executes forever to find location if:
◦ Hash table is full, therefore maintain the count.
Resolve hash clashes using Open
0
Address
1
0
1
2
2
A
3
3
I
4
4
J
5
5
6
6
C
7
7
B
8
8
D
9
9
F
6
10
10
K
G 43 7
11
11
H 67 16
12
12
I
88 3
13
13
J
36 2
14
14
K 40 6
15
15
16
16
k
H(k)
A 53 2
H(k) = k mod 17
B 41 7
C 91 6
D 75 7
E 13 13
F
6
Index between
0 and 16
E
H
Hashing
Good hash function is one minimize the
collision
 Hashing allow direct access therefore
preferable to other search technique.
 It is more efficient to initialized hash table
with extra spaces to reduce rehash many
times.

Resolve hash clashes by open
addressing
Store all elements in table
 If a cell is occupied, try another cell.
 Linear probing, try cells

◦ H(k), H(k) + 1 mod m, H(k) + 2 mod m, . .

Rh(i)=(i+2)%17 used to rehash function. Then any
key that hashes into an even integer rehashes into
successive even integers even then list is empty in
odd locations and same with odd integers.
Primary Clustering
The phenomena, where two keys
that hash into different values
compete with each other in
successive rehashes, is called
primary clustering.
Example:
any record whose new coming key
hashes into 4,5,6,7 will be placed
in 7, where as only record whose
key hashes into 7 will be placed in
that location.

0
4700
1
2
3
4
114
5
665
6
346
7
8
508
9
109
Primary Clustering





To eliminate the primary clustering is to allow
the rehash to depend on the number of time
hash function is applied:
rh(i,j), rehash value I if the key is being rehashed
for the j time.
E.g. rh(i,j)=(i+j)%tablesize
Another approach is to use random number
between tablesize to ensure that no two
rehashes for same key conflict.
However these approaches eliminate primary
clustering but not eliminate secondary
clustering.
Secondary clustering
The secondary clustering is a phenomena
in which different keys that hash to the
same value follow the same rehash path.
 To eliminate all clustering, use another
phenomena called double hashing.

Double hashing





Involve the use of two hash function
h1(key) and h2(key)
h1 is the primary key function then If the
position is occupied then
rh(i,key)=(i+h2(key))%tablesize until empty loc
is found.
The rehash function use original key with hash
value
As long as h2(key1) does not equal h2(key2),
records do not compete for the same set of
locations.
Analysis
The efficiency of hashing method is
measured by the average number of table
position that must be examined in
searching for a particular item.
 Let n be the number of item currently in
hash table and table size be the number
of positions in table then for large table it
has been proved that average number of
probes required for successful search.

Analysis
The load factor x is defined as the ratio between
occupied slots and table size.
x=(n-1)/tablesize
 If the fraction of location that is occupied is equal
to the load factor x, we can approximate the
number of probe (checks) for successful search
under linear rehashing.
 When the load factor is reaches to the tablesize
(table is 80% full) then create a hashtable of
double size to minimize rehashing.

Analysis



Worst-case: All keys hash to the same bucket. Insert takes
O(1), but delete and find take O(N).
In linear probing max Load factor <=1
Performance of the hash tables, based on open addressing
scheme is very sensitive to the table's load factor. If load
factor exceeds 0.7 threshold, table's speed drastically
degrades. Indeed, length of probe sequence is proportional
to (loadFactor) / (1 - loadFactor) value. In extreme case,
when loadFactor approaches 1, length of the sequence
approaches infinity. In practice it means, that there are no
more free slots in the table and algorithm will never find
place to insert a new element. Hence, this kind of hash
tables should support dynamic resizing in order to be
efficient.
Chaining
One simple scheme is to chain all
collisions in lists attached to the
appropriate slot.
 This allows an unlimited number of
collisions to be handled and doesn't
require a priori knowledge of how
many elements are contained in the
collection.
 The trade off is the same as with
linked lists versus array
implementations of collections

Chaining
In worst case, all keys map to same index.
 Good hash function should uniformly
distributes keys in all indices.
 Load factor x=n/tablesize. Where n is
number of slots occupied.
 If x is load factor then O(1+x), where x
proportion of comparison need to search
in linklist or array list which is refer by an
index.

Analysis
In chaining load factor n/m
 Worst-case: All keys hash to the same
bucket. Insert takes O(1), but delete and
find take O(N).
 Chaining Average Case:

◦ If keys are uniformly distributed, then all
buckets are about the same size.
◦ Each linked list has expected size O(N/M).
Thus, insert is still O(1), but delete and find
become O(N/M). N/M is called the load factor.
Chaining Vs Open addressing
Chaining
Open addressing
Collision resolution
Using external data structure
Using hash table itself
Memory waste
Pointer size overhead per entry
(storing list heads in the table)
No overhead 1
Performance dependence on
Directly proportional
table's load factor
Proportional to (loadFactor) / (1
- loadFactor)
Allow to store more items,
than hash table size
No. Moreover, it's
recommended to keep table's
load factor below 0.7
Yes
Hash function requirements Uniform distribution
Uniform distribution, should
avoid clustering
Handle removals
Removals are ok
Removals clog the hash table
with "DELETED" entries
Implementation
Simple
Correct implementation of open
addressing based hash table is
quite tricky
Hash Maps
A map data structure stores (key, value) pairs.
Values are retrieved by searching for the
appropriate key. Maps are also sometimes called
dictionaries or associative arrays.
 Maps can be implemented in a number of ways.
One of the most common ways is to store the
(key, value) pairs in a hash table. This is called a
hash map.
