File Structures

Transcript File Structures

Pengindeksan
Dan
Fail Songsang
(inverted File)
Penjanaan Fail Indeks Songsang
Word Extraction
Word IDs
Dokumen Asal
W1:d1,d2,d3
W2:d2,d4,d7,d9
Document IDs
Wn :di,…dn
Inverted Files
Indeks Songsang


Sistem capaian maklumat membangunkan indeks songsang untuk
mencari katakunci dalam koleksi dokumen dengan berkesan.
Indeks songsang mengandungi dua komponen iaitu satu senarai bagi
setiap katakunci yang dipanggil indeks dan satu senarai yang dipanggil
posting list.
Posting List panjang
Posting List pendek
Terbaik
disimpan
utama
jika
dalam
indeks
ingatan
Disebabkan saiznya posting list
disiimpan dalam disk
Penjanaan Fail Indeks Songsang


Map the file names to file IDs
Consider the following Original Documents
D1
The Department of Computer Science was established in 1984.
D2
The Department launched its first BSc(Hons) in Computer Studies in
1987.
D3
followed by the MSc in Computer Science which was started in 1991.
D4
The Department also produced its first PhD graduate in 1994.
D5
Our staff have contributed intellectually and professionally to the
advancements in these fields.
Penjanaan Fail Indeks Songsang
green: stop word
D1
The Department of Computer Science was established in 1984.
D2
The Department launched its first BSc(Hons) in Computer Studies
in 1987.
D3
followed by the MSc in Computer Science which was started in
1991.
D4
The Department also produced its first PhD graduate in 1994.
D5
Our staff have contributed intellectually and professionally to the
advancements in these fields.
Penjanaan Fail Indeks Songsang
After stemming, make lowercase (option), delete numbers (option)
D1
depart comput scienc establish
D2
depart launch bsc hons comput studi
D3
follow msc comput scienc start
D4
depart produc phd graduat
D5
staff contribut intellectu profession advanc field
Penjanaan Fail Indeks Songsang
(belum terisih)
Words
Documents
Words
Documents
depart
d1,d2,d4
produc
d4
comput
d1,d2,d3
phd
d4
scienc
d1,d3
graduat
d4
establish
d1
staff
d5
launch
d2
contribut
d5
bsc
d2
intellectu
d5
hons
d2
profession
d5
studi
d2
advanc
d5
follow
d3
field
d5
msc
d3
start
d3
Penjanaan Fail Indeks Songsang
(terisih)
Words
Documents
Words
Documents
advanc
d5
msc
d3
bsc
d2
phd
d4
comput
d1,d2,d3
produc
d4
contribut
d5
profession
d5
depart
d1,d2,d4
scienc
d1,d3
establish
d1
staff
d5
field
d5
start
d3
follow
d3
studi
d2
graduat
d4
intellectu
d5
launch
d2
Pembinaan indeks

Setiap dokumen diwakilkan dalam bentuk vektor
• <term1, term2, term3, …, termn>
• Setiap kemasukkan data menggambarkan bilangan sesuatu
term itu ujud pada satu-satu dokumen
Document ids
nova
10
A
5
B
C
D
E
F
G 5
H
I
galaxy heat
5
3
10
7
6
10
h’wood
film
role
10
9
8
10
7
5
2
7
9
8
5
diet
fur
10
9
10
10
1
3
Terms
Indeks Songsang


Secara konsep, ianya telah dipelajari dalam model ruang vektor
dimana ianya dijanakan dalam bentuk vektor di antara term vs
dokumen.
Fail songsang merupakan “songsangan” dari fail vektor dimana
lajur menjadi baris dan baris menjadi lajur.
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
t1
1
1
0
1
1
1
0
0
0
0
t2
0
0
1
0
1
1
1
1
0
1
t3
1
0
1
0
1
0
0
0
1
1
Terms D1
t1
t2
t3
D2
1
0
1
D3
1
0
0
D4
0
1
1
D5
1
0
0
D6
1
1
1
…
D7
1
1
0
0
1
0
Pembinaan Fail Songsang

Dokumen dihuraikan bagi menghasilkan token dan ia
disimpan bersama dengan ID dokumen
Doc 1
Doc 2
Now is the time
for all good men
to come to the aid
of their country
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Pembinaan Fail
Songsang

Setelah selesai semua dokumen
dihuraikan, maka fail songsang
diisih dalam bentuk tersusun.
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Pembinaan Fail
Songsang

Term yang berulang pada
sesuatu dokumen akan
dicantumkan (tambah nilai
kekerapan)
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Pembinaan Fail Songsang

Kemudian fail boleh dipecahkan kepada dua iaitu
• Fail Dictionary
dan
• Fail Postings
Pembinaan Fail Songsang
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Dictionary
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Postings
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Fail Songsang
Kelebihan
 Meningkatkan keberkesanan penggelintaran.
Kelemahan
 Keperluan menyimpan struktur data yang saiznya 10 – 100%
lebih besar daripada saiz teks dan keperluan untuk menukar
indeks jika terdapat penukaran data.
 Proses pengemaskinian indeks adalah mahal tetapi
tatasusunan yang tersisih mudah dijanakan dan cepat.
Struktur Data yang digunakan
pada Fail Songsang




Tatasusunan Terisih (Sorted Arrays)
Pohon B
Struktur Cincangan (Hashing Structures)
Tries (digital search trees)
Tatasusunan Terisih




Fail songsang yang menggunakan metod ini
menyimpan katakunci dalam bentuk tatasusunan
terisih, berserta dengan bilangan dokumen yang
mengandungi katakunci tersebut dan hubungan yang
menghubungkan ke dokumen-dokumen tersebut.
Penggelintaran dalam tatasusunan ini ialah
berdasarkan penggelintaran binari.
Kebaikan : senang nak diimplementasi
Keburukan : pengemaskinian indeks agak mahal
Tatasusunan Terisih
Penghasilan tatasusunan fail songsang terisih boleh dibahagi kepada 2
atau 3 langkah:
1.
Teks yang digunakan sebagai input dihuraikan menjadi senarai
perkataan-perkataan berserta dengan lokasinya dalam teks
(tentukan penggunaan katahenti dan cantasan sama ada perlu
dimasukkan atau tidak. Ini bergantung kepada kekangan
penggunaan masa dan storan dalam operasi pengindeksan).
2.
Senarai perkataan di songsangkan dari senarai perkataan dalam
susunan lokasi ke senarai perkataan terisih bagi kegunaan carian.
Pengisihan dibuat dalam susunan tertentu beserta semua lokasi
yang dikaitkan bagi setiap term/perkataan.
Tatasusunan Terisih
3.
Proses lanjutan terhadap fail songsang yang terhasil seperti
meletakkan pemberat sebutan atau penyusunan semula atau
penggunaan pemadatan (compression) bagi fail. (proses ini adalah
opsional)
Pohon B
Pohon-B biasanya digunakan untuk tujuan gelintaran data. Ia mesti
mempunyai nombor kunci dan anak. Pohon pada order m nerupakan
pohon dimana setiap nod mempunyai sebanyak-banyaknya m anak.
Bagi setiap nod, jika k merupakan bilangan sebenar anak pada nod,
maka k-1 merupakan bilangan kunci pada nod
Rujuk rajah dibawah dimana baris pertama menunjukkan nod bagi
setiap kunci manakala baris kedua menunjukkan penunjuk ke kunci
anak.
Pohon B
Jika pohon gelintar dalam order 4 maka ia harus memenuhi syarat
berikut
 The keys in each node are in ascending order.
 Bagi setiap nod jika berikut adalah benar.
• Sub pokok bermula dari rekod Node.Branch[0] hanya ada kunci
yang kurang dari Node.key[0]
• Sub pokok bermula dari rekod Node.Branch[1] hanya ada kunci
yang lebih dari Node.key[0] dan pada masa yang sama kurang
dari Node.Key[1]
• Sub pokok bermula dari rekod Node.Branch[2] hanya ada kunci
yang lebih dari Node.key[1] dan pada masa yang sama kurang
dari Node.Key[2]
• Sub pokok bermula dari rekod Node.Branch[3] hanya ada kunci
yang lebih dari Node.key[2]

Pohon B

Berikut merupakan contoh bagi pohon-B dengan order 5. Ini bermaksud
semua nod luar mempunyai sekurang-kurangnya ceil(5/2) = 3 anak.
Bilangan maksimum anak bagi nod adalah 5 (4 adalah bilangan
maksimum kunci). Setiap nod daun mesti mengandungi sekurangkurangnya 2 kunci.
Pohon B (Kemasukkan Data Baru)




Katakan kemasukkan data baru
akan dibuat ke atas pohon-B yang
kosong di mana ia menggunakan
order 5.
Diberi huruf-huruf berikut : C N G A
H E K Q M F W L T Z D P R X Y S.
Ini bermaksud nod boleh
mempunyai maksima 5 anak dan 4
kunci. Semua nod selain akar mesti
mempunyai minimum 2 kunci.
4 huruf dimasukkan pada nod
seperti rajah disebelah
Pohon B (Kemasukkan Data Baru)
Masukkan H,
Masukkan E, K, dan Q
Masukkan M
Pohon B (Kemasukkan Data Baru)
Huruf F, W, L, dan T
masuk Z
Pohon B (Kemasukkan Data Baru)
Masukkan D
Masuk S
Pohon B (Penghapusan Data)
Penghapusan huruf H
Pohon B (Penghapusan Data)
Hapuskan huruf T.
Cincangan
Apa itu Cincangan ?
 Teknik untuk menentukan indeks atau lokasi untuk
menyimpan data pada struktur data.
 Fungsi cincangan :
• Untuk menghantar kunci carian/gelintar.
• Merupakan satu transformasi kepada bentuk kunci
• Kebiasaannya dalam bentuk formula matematik
• Memulangkan indeks dimana akan disimpan dan
untuk capaian data pada jadual.
Konsep Asas
We can think of
hashing as a keyto-address
transformation 
the keys map to
addresses in a list.
Cincangan



Fungsi cincang ialah fungsi h(k) yang menukarkan data
kepada kunci iaitu suatu alamat bagi suatu julat 0 
SaizJadual-1
Fungsi cincang digunakan untuk memetakan kekunci ke
dalam slot di dalam Jadual cincangan.
Contoh :
•
Katakan kita menentukan untuk menggunakan 1000 alamat
maka jika U merupakan semua kemungkinan set kekunci,
maka fungsi hash adalah dari U ke {0, 1, 2, …..999}
k
Kod ASCII untuk
2 huruf pertama
Hasil darab
(d)
h(k)= d mod
1000
BALL
66, 65
66.65 = 4290
290
LOWELL
76, 79
76.79 = 6004
004
TREE
84, 82
84.82 = 6888
888
000
001
..
004 LOWELL
..
290 BALL
…
888 TREE
…
999
Contoh
Hash(const char *Key, const int TableSize)
{
int HashVal = 0;
while (*key != ‘\0’)
HashVal += *key++;
return HashVal % TableSize
}
U
T
(universe of keys)
k
K 1 k
k5
4
(actual k
7
keys)
k6
k2
k8 k3
——
——
——
——
——
——
Fungsi cincangan yang baik
for (hash=len; len--;)
{
hash = ((hash<<5)^(hash>>27))^*key++;
}
hash = hash % prime;
Cincangan
Namun begitu, terdapat kekunci yang berbeza tetapi dihantar alamat yang
sama maka akan berlaku perlanggaran (collision)
Seperti contoh sebelum, di mana terdapat dua atau lebih yang bermula
dengan 2 huruf pertama yang sama.
Maka satu proses yang dinamakan cincangan semula (rehashing) perlu
dilakukan
T
U
(universe of keys)
k
K 1 k
k5
4
(actual k
7
keys)
k6
k2
k8 k3
——
——
——
——
——
——
Cincangan Semula (Rehashing)
Fungsi Cincangan semula
Fungsi kedua yang boleh digunakan untuk memilih lokasi jadual
bagi item baru yang akan dimasukkan. Jika lokasi tersebut juga
telah digunakan maka fungsi rehash boleh digunakan bagi
mendapat lokasi ketiga dan seterusnya.
Contoh mudah fungsi rehash :
rehash(k) = (k + 1) % prime
Cincangan Semula (Rehashing)
Kaedah untuk mengurangkan perlanggaran
 Cuba dapatkan fungsi cincangan yang terbaik untuk penaburan
rekod
 Penggunakan ruang ingatan yang lebih besar. Meningkatkan
ruang pengalamatan, contohnya jika keperluan ialah 1000
maka lebihkan sehingga 2000 ruang tambahan.
 Letakkan lebih dari satu rekod pada satu alamat (penggunaan
buckets)
Rantaian (Chaining)

Chaining puts elements that hash to the same slot in
a linked list:
T
U
(universe of keys)
k1
K
k4
k5
(actual
k7
keys)
k6
k2
k8 k3
——
——
——
——
——
——
k1
k4 ——
k5
k2
k3 ——
k8
k6 ——
k7 ——
Hashing (Abu Ata)




Memudahkan sesuatu alamat disimpan dan dicapai secara terus
serta cepat dan betul.
Dikira berdasarkan Kod ASCII bagi sesuatu huruf dan dijadi
penghubung antara huruf-huruf perkataan yang diindeks
Hubungan dikira berdasarkan susunan huruf antara set huruf dan
untuk huruf berikutnya berdasarkan susunan huruf yang
bersebelahan.
Mungkin berlaku perlanggaran. Cincangan semula dilakukan dan
satu alamat baru akan dijanakan bagi mendapatkan satu rekod
yang kosong.
Hashing (Abu Ata)
1.
2.
3.
4.
Semua huruf ditukar kepada huruf kecil
Set 26 huruf abjad diberi nilai berdasarkan susunan jujukan
dalam set abjad contoh : a=1, b=2, c=3 ……..,y=25,z=26
Huruf bagi suatu perkataan dan huruf yang berikutnya dan
pengiraan adalah seperti berikut
i. Kedudukan huruf pertama dalam set abjad (peraturan 2)
ii. Kedudukan huruf kedua dalam set abjad (peraturan 2)
iii. Keputusan pada (i) di darab dengan 26
iv. Campur keputusan pada (ii) dan (iii)
Campur keputusan bagi peraturan di 2 dengan peraturan di 3
Basic Concepts


In this case, we must use the collision
resolution algorithm to determine the
next possible location for the data and
continue until we find the correct data.
Each calculation of an address and test
for success is known as a probe.
Sumber : http://www.ee.udel.edu/~durbano/teaching/CISC220/slides/38
Hashing Methods

There are several hashing methods that we will
discuss:
•
•
•
•
•
•
•
•
Direct
Subtraction
Modulo-Division
Digit Extraction
Midsquare
Folding
Rotation
Pseudorandom Generation
Direct Method

In direct hashing, the key is the address
without any algorithmic manipulation.
Direct Method


In this case, the hash table must contain
an element for every possible key.
Although it has a limited use, it is
powerful in the sense that it is easy to
code and there are no synonyms.
Direct Method



As an example, consider a small
company with less than 100 employees.
Each employee is assigned an employee
number (from 0 to 99).
By storing the employees in an array of
size 100, we can reference an employee
simply by using the employee number as
the index into the array.
Direct Method



Obviously, the direct method has limited uses.
Namely, it can only be used on small data sets.
For example, it would be impractical to use
direct hashing via the SSN of our employees.
If we did, we would have a 9 digit number as
the index into our array (i.e., we would need an
array of size 1 billion but would use less than
100 entries!)
Subtraction Method



Sometimes, keys may be consecutive,
but may not start from ‘1’.
Consider our small company – what if
we assigned employee numbers from
1000 to 1099?
In this case, our hashing function would
simply subtract 1000 from the key value
to produce the address (0 to 99).
Subtraction Method

Algorithm:
address = key – subtractionConstant

As with the direct method, the
subtraction method is easy to implement,
guarantees no collisions, and has limited
uses.
Modulo-Division Method


Also known as the division remainder
method, the modulo-division method
divides the key by the array size and
uses the remainder for the address.
Algorithm
address = key % listSize
Modulo-Division Method


Although this algorithm will work with any
size list, we typically choose a list size
that is a prime number.
This has the effect of reducing the
number of collisions.
Modulo-Division Method


To continue with our small company
example, let’s say we are planning on
expanding our company.
In our new system, employees will
receive employee numbers from 0 to
999,999 and we will provide space in our
data structure for up to 300 employees.
Modulo-Division Method



We start by choosing a list size of 307
(the first prime number above 300).
Therefore, our available address space
is 0 to 306 (key%307=[0,306]).
As an example, let’s say we want to
hash Bryan’s employee number
121267:
121267/307 = 395 remainder 2
Therefore, hash(121267)=2
Modulo-Division Method
Modulo-Division Method

Note: in a test situation, I expect you to
be able to perform the modulus
operation on small numbers.
Digit-Extraction Method


Using digit extraction, selected digits are
extracted from the key and used as an address.
For example, using our 6-digit employee
number from before, if we wanted to realize a
3-digit address, we could select the 1st, 2nd, and
last digits to create our address
379452  372
121267  127
Midsquare Method


In midsquare hashing, the key is
squared and the address is selected from
the middle of the result.
For example, if our key value were 9452:
9452*9452=89340304
address = 3403
Midsquare Method


A limitation to the use of this method is
the size of key.
Because squaring a key produces a
number twice the length of the key, this
method will only work for small key
values.
Midsquare Method


However, if we wish to apply the
midsquare method to large key values,
we can simply choose a subset of the
digits of the key to square (sort of like
digit extraction).
address
For example:
379452  379*379 = 143641 = 364
key
Folding Methods

Two folding methods are used:
• Fold shift
• Fold boundary
Fold Shift



In fold shift, the key value is divided into
parts whose size matches the size of the
required address.
Then, the left and right parts are shifted
and added with the middle part.
Should the addition result in a carry digit,
that digit is simply dropped.
Fold Boundary



In fold boundary, the left and right
numbers are folded on a fixed boundary
between them and the center number.
The two outside values are thus
reversed.
Should the addition result in a carry digit,
that digit is simply dropped.
Folding Methods
Rotation Method




In the rotation method, we rotate a digit to the
front or back of the key.
This has the effect of spreading the keys more
evenly over the key space.
Rotation hashing is usually used in conjunction
with other hashing methods, which results in a
more effective hash.
Example: imagine selecting only the first 3
digits of the following keys.
Rotation Method
Pseudorandom Method


Here, we use the key as the seed in a
pseudorandom number generator.
The resulting random number is then
scaled into the possible address range
using modulo division.
Pseudorandom Method


One example of a random number generator is
y = ax + c
where
x = key
a = scaling coefficient
c = constant
address = y%listSize
For maximum efficiency, a and c should be
prime numbers.
Collision Resolution




With the exception of direct hashing and
subtraction hashing, none of the hashing
methods we discussed result in a one-to-one
mapping.
Therefore, as discussed before, a collision
may occur.
Fortunately, there are many methods of
dealing with collisions (all of which are
independent of the hashing method used).
That is, any collision resolution algorithm can
be used with any hashing algorithm.
Collision Resolution

Generally, there are two different
approaches to resolving collisions:
• Open addressing
• Linked Lists

A third concept, buckets, defers
collisions, but does not fully resolve
them.
Open Addressing


Open addressing resolves collisions in
the prime area (the area that contains all
of the home addresses).
This technique is contrasted with linked
list resolution, in which the collisions are
resolved by placing the data in a
separate overflow area.
Open Addressing


When a collision occurs, the prime area
addresses are searched for an open element
where the new data can be placed.
We will discuss 4 methods of open addressing:
•
•
•
•
Linear probe
Quadratic probe
Double hashing
Key offset
Linear Probe
When data cannot be
stored at the home
address, we resolve the
collision by adding 1 to the
current address.
Here, we get a collision at
address 1. To resolve this,
we try to insert the data at
2. However, this location
is occupied, so we try
address 3.
Linear Probe



As an alternative to a simple linear probe, we
can add 1, subtract 2, add 3, subtract 4, etc.
until we locate an empty element.
Note: the code that does the collision
resolution must verify that the new address is
within the address space.
For example, if we are at the last element of
the list, when we add 1, we must start back at
the beginning of the list.
Linear Probe

Linear probes have 2 advantages:
• Easy to implement
• Data tends to remain near their home
addresses (good for caching)

However, linear probes tend to produce
primary clustering.
Linear Probe


After the collision has been resolved,
hashing continues as it did before the
collision.
The next time a collision occurs, we restart our resolution algorithm by adding 1
to the address and then continue as
before.
Quadratic Probe




We can eliminate the primary clustering
phenomenon in the linear probe by adding a
number other than 1 to the address.
One example of this is the quadratic probe.
Here, the increment is the collision probe
number squared.
Thus, for the first collision we add 12, the
second collision we add 22, the third collision
32, etc.
Quadratic Probe


Again, we have to make sure that we
don’t run off the end of the address list.
To do this, we use the modulus of the
new address and the list size.
new address =
(last address tried + probe2)%listSize
Quadratic Probe



Disadvantages:
•
•
•
The time it takes to perform the ‘square’ operation
Produces secondary clustering
It is not possible to generate a new address for every
element in the list
To help alleviate the last disadvantage, we
choose a list size that is a prime number.
This will allow at least half of the list to be
reachable (a reasonable number).
Quadratic Probe


After the collision has been resolved,
hashing continues as it did before the
collision.
The next time a collision occurs, we start
resolution again with 12 and continue as
before.
Double Hashing



The next two open addressing methods
are collectively known as double
hashing.
In double hashing, rather than use an
arithmetic probe function, as in the linear
and quadratic probes, we rehash the
address.
This prevents primary clustering.
Double Hashing




The probe sequences used by both linear and
quadratic probing are key independent.
For example, linear probing inspects the table
locations sequentially, no matter what the
value of the key is.
In contrast, double hashing defines keydependent probe sequences.
In this scheme, the probe sequence still
searches the table in a linear order, but a
second hash determines the size of the steps
taken.
Pseudorandom Collision
Resolution



The first method uses a pseudorandom
number to resolve the collision.
This is basically the same process as the
pseudorandom hashing function.
In this case, however, instead of using
the key as the seed to the
pseudorandom number generator, we
use the collision address as the seed.
Pseudorandom Collision
Resolution
Here, we have a
collision at address 1.
To resolve this collision,
we use the collision
address (1) in our
pseudorandom number
generator
y=3(1)+5=8
Therefore, we try
address 8 as our new
address.
Pseudorandom Collision
Resolution


Disadvantage: all of the keys will follow
only 1 collision resolution path through
the list.
Therefore, this method will lead to
secondary clustering.
Key Offset



The second double hashing method is
key offset.
This method will produce different
collision paths for different keys.
Whereas the pseudorandom number
generator produces a new address as a
function of only the collision address, the
key offset method uses both the original
key and the collision address to calculate
the new address.
Key Offset

Here is one of the simplest
implementations:
offset = key/listSize; // integer arithmetic
address = (old address + offset)%listSize;


Here, we calculate an offset value based on the
key and add this value to the collision address.
Does this method lead to primary or secondary
clustering?
Linked List Resolution



In open addressing, we resolve collisions by
placing the data in the same memory area as
the rest of the data (the prime area).
One problem with this approach is that each
resolved collision increases the probability of
future collisions.
This disadvantage can be eliminated by using a
linked list resolution approach rather than an
open addressing approach.
Linked List Resolution


Linked list resolution uses a separate
area (the overflow area) to store the
collisions and chains all of the synonyms
together in a linked list.
When a collision occurs, one element is
stored in the prime area and the other
element is stored in the overflow area.
Linked List Resolution
Here, we have had 2
collisions at address 1. The
collision is resolved by
placing the synonyms in a
linked list with the head
element in the prime area.
Linked List Resolution


Items are usually inserted in a last-in, first-out
(LIFO) order. This allows for fast insertions as
the list need not be scanned … the element is
simply inserted in the prime area.
Another possible ordering is a key sequenced
list where the data with the smallest key value
is stored in the prime area, allowing for fast
retrievals.
Bucket Hashing



Another approach to handling the
collision problem is bucket hashing.
A bucket is a node that can
accommodate multiple data
occurrences.
Because the bucket can hold multiple
data values, collisions can be postponed
until the bucket is full.
Bucket Hashing
Here, we see an
implementation with a
bucket size of 3. This
structure can accommodate
up to 3 synonyms before a
collision will occur.
Bucket Hashing

Disadvantages:
• More space is used (many buckets will be
•


empty or partially empty)
It does not completely resolve collisions
When a collision does occur, a typical
resolution is to use a linear probe.
Here, we assume that the adjacent
bucket will have an empty space.
Bucket Hashing


Question: Why not just increase the
size of the hash table instead of using
buckets?
Answer: Entire bucket will probably be
cached (its contents are adjacent in
memory). Thus, multiple “probes” will
likely “hit” in cache.
Combination Approaches


Typically, we often use multiple steps to
resolve collisions.
For example, we might use bucket
hashing. Should a collision occur, we
will perform up to, say, 3 linear probes to
try to resolve the collision. Then, we
may resort to a linked list resolution.
What Makes A Good Hashing
Function?
1)
2)
A hashing function should be fast and
easy to compute.
A hashing function should scatter the
data evenly throughout the hash table.
•
How well does the the hash function scatter
random data? Nonrandom data?
Tips For Developing Good
Hashing Functions


The calculation of the hashing function
should involve the entire search key.
Thus, for example, computing the
modulus of the entire ID number is much
safer than using only its first 2 digits.
Tips For Developing Good
Hashing Functions



If a hashing function uses modulo
arithmetic, the base should be a prime
number.
That is, if h is of the form:
h(x) = x mod listSize
then listSize should be a prime number.
This is a safeguard against many subtle
kinds of patterns in the data (for
example, search keys whose digits are
likely to be multiples of one another).
Disadvantage of Hashing



For all of its advantages, one of the major
disadvantages of hashing is trying to traverse
the data in sorted order.
Traversals are inefficient because a good
hashing function scatters items as randomly as
possible throughout the array.
Hence, in order to traverse the table in sorted
order, you would first have to sort the items.

File Structures

Transcript File Structures

Directory