Chapter 8 Indexing and Searching

Download Report

Transcript Chapter 8 Indexing and Searching

Chapter 8
Indexing and Searching
Hsin-Hsi Chen
Department of Computer Science and
Information Engineering
National Taiwan University
Hsin-Hsi Chen
8-1
Introduction
• searching
– Online text searching
• Scan the text sequentially
– Indexed searching
• Build data structures over the text to speed up the search
• Semi-static collections: updated at reasonably regular interval
• indexing techniques
– inverted files
– suffix (PAT) arrays
– signature files
Hsin-Hsi Chen
8-2
Assumptions
•
•
•
•
•
n: the size of text databases
m: the length of the search patterns (m<n)
M: the amount of memory available
n’: the size of texts that are modified (n’<n)
Experiments
– 32bit Sun UltraSparc-1 of 167 MHz with 64
MB of RAM
– TREC-2 (WSJ, DOE, FR, ZIFF, AP)
Hsin-Hsi Chen
8-3
File Structures for IR
• lexicographical indices
– indices that are sorted
– inverted files
– Patricia (PAT) trees (Suffix trees and arrays)
• cluster file structures (see Chapter 7 in
document clustering)
• indices based on hashing
– signature files
Hsin-Hsi Chen
8-4
Inverted Files
Hsin-Hsi Chen
8-5
Inverted Files
• Each document is assigned a list of keywords or
attributes.
• Each keyword (attribute) is associated with
operational relevance weights.
• An inverted file is the sorted list of keywords
(attributes), with each keyword having links to the
documents containing that keyword.
• Penalty
– the size of inverted files ranges from 10% to 100%
of more of the size of the text itself
– need to update the index as the data set changes
Hsin-Hsi Chen
8-6
1
6 9 11
17 19
24 28
33
40
46 50
55
60
This is a text. A text has many words. Words are made from letters.
Text
Vocabulary
Occurrences
addressing granularity:
letters
60 …
(1) inverted list –
made
50 …
word positions
many
28 …
character positions
text
11, 19 …
(2) inverted file –
words
33, 40 …
document
Heaps’ law: the vocabulary grows as O(n), : 0.4~0.6
Vocabulary for 1GB of TREC-2 collection: 5MB
(before stemming and normalization)
Occurrences: the extra space O(n)
30% ~ 40% of the text size
Hsin-Hsi Chen
8-7
Block Addressing
• Full inverted indices
block:
– Point to exact occurrences fixed size blocks, block =
files, documents, retrieval units?
• Blocking addressing
Web pages, …
– Point to the blocks where the word appears
– Pointers are smaller
– 5% overhead over the text size
Block1
Block2
Block3
Block 4
This is a text. A text has many words. Words are made from letters.
Vocabulary
Occurrences
Text
letters
4…
made
4…
Inverted index
many
2…
text
1, 2 …
Hsin-Hsi Chen
8-8
words
3…
Sorted array implementation of an inverted file
the documents in which keyword occurs
Hsin-Hsi Chen
8-9
Full inversion (all words, exact positions, 4-byte pointers)
Indexing
Small
collection Medium collection Large
(1 MB)
Addressing 45%
(200MB)
Collection
(2GB)
73%
36%
64%
35%
63%
26%
18%
32%
26%
47%
41%
18%
32%
5%
9%
25%
1.7%
2.4%
0.5%
0.7%
words
Addressing 19%
documents
Addressing 27%
64k blocks
Addressing 18%
256 blocks
2 or 1 byte(s) per pointer independent of the text size
document size (10KB), 1, 2, 3 bytes per pointer, depending on text size
Hsin-Hsi Chen
Stop words are not indexed All words are indexed
8-10
Searching
• Vocabulary search
Three general steps
– Identify the words and patterns in the query
– Search them in the vocabulary
• Retrieval of occurrences
– Retrieve the lists of occurrences of all the words
• Manipulation of occurrences
– Solve phrases, proximity, or Boolean operations
– Find the exact word positions when block addressing is
used
Hsin-Hsi Chen
8-11
Structures used in Inverted Files
• Sorted Arrays
–
–
–
–
•
•
•
•
store the list of keywords in a sorted array
using a standard binary search
advantage: easy to implement
disadvantage: updating the index is expensive
B-Trees
Tries
Hashing Structures
Combinations of these structures
Hsin-Hsi Chen
8-12
Trie
1
6 9 11
17 19
24 28
33
40
46 50
55
60
This is a text. A text has many words. Words are made from letters.
Text
letters: 60
‘l’
‘m’
‘t’
‘d’
made: 50
‘a’
‘n’
many: 28
Vocabulary
trie
text: 11, 19
‘w’
words: 33, 40
Hsin-Hsi Chen
8-13
B-trees
F
Al Br E
Afgan 2
Hsin-Hsi Chen
…
M
Rut Uni
Gr H Ja L
Russian 9
…
Ruthenian 1 …
8-14
Sorted Arrays
1. The input text is parsed into a list of words along with their
location in the text. (time and storage consuming operation)
2. This list is inverted from a list of terms in location order to
a list of terms in alphabetical order.
3. Add term weights, or reorganize or compress the files.
Hsin-Hsi Chen
8-15
Inversion of Word List
“report”
appears
in two
records
Hsin-Hsi Chen
8-16
Dictionary and postings file
Idea: the file to be searched should be as short as possible
split a single file into two pieces
(vocabulary)
(occurrences)
e.g. data set: 38,304 records, 250,000 unique
terms 88 postings/record
(document #, frequency)
Hsin-Hsi Chen
8-17
Producing an Inverted File for Large Data Sets without Sorting
Idea: avoid the use of an explicit
sort by using a right-threaded
binary tree
current number of term postings &
the storage location of postings list
traverse the binary tree and the
linked postings list
Hsin-Hsi Chen
8-18
Indexing Statistics
Final index: only 8% of input text size for 50MB database
14% of the input text size for the larger database
Working storage: not much larger than the size of final index for
new indexing method
the storage needed to build the index
p.17&18 p.20
933
2GB
the same
Hsin-Hsi Chen
8-19
A Fast Inversion Algorithm
• Principle 1
the large primary memories are available
If databases can be split into memory loads that
can be rapidly processed and then combined, the
overall cost will be minimized.
• Principle 2
the inherent order of the input data
It is very expensive to use polynomial or even
nlogn sorting algorithms for large files
Hsin-Hsi Chen
8-20
FAST-INV algorithm
concept postings/
pointers
See p. 22.
Hsin-Hsi Chen
8-21
Sample document vector
document number
concept number (one concept number
for each unique word)
Similar to the documentword list shown in p. 16.
The concept numbers are
sorted within document
numbers, and
document numbers are
sorted within collection
Hsin-Hsi Chen
8-22
HCN=highest concept number in dictionary
(total number of concepts in dictionary)
L=number of concepts/document (documents/concept)
pairs in the collection
M=available primary memory size, in bytes
M>>HCN, M < L
L/j<M, so that each part fill fit into primary memory
HCN/j concepts, approximately, are associated with
each part
number of concept-weight pairs
Let LL=length of current load (8 bytes for each concept-weight)
S=spread of concept numbers in current load (4 bytes
for each count of posting)
Hsin-Hsi Chen
8-23
8*LL+4*S < M
Preparation
1. Allocate an array, con_entries_cnt, of size HCN.
2. For each <doc#, con#> entry in the document vector file:
increment con_entries_cnt[con#]
……………………0
(1,2), (1,4)……….. 2
(2,3) …………….. 3
(3,1), (3,2), (3,5) ... 6
(4,2), (4,3) ………. 8
...
Hsin-Hsi Chen
8-24
Preparation (continued)
5. For each <con#,count> pair obtained from con_entries_cnt:
if there is no room for documents with this concept to fit in
the current load, then created an entry in the load table and
initialize the next load entry; otherwise update information
for the current load table entry.
Hsin-Hsi Chen
8-25
: the range of concepts
for each primary load
如何產生Load表?
LL:length of current load
S: end concept-start
concept +1
space for concept/
weight pair*LL+
space for each concept
to store count of
postings*S < M
讀入(Doc,Con)
依Con去查Load
表,確定這個
配對該落在那
個Load
Hsin-Hsi Chen
留意:SS為了管理
上資料的添加
依序將每個Load
File反轉。CONPTR
表中的Offset顯示每
筆資料該填入的位
置。
copy rather
than sort8-26
PAT Tress and PAT Arrays
(Suffix Trees and Suffix Arrays)
Hsin-Hsi Chen
8-27
PAT Trees and PAT Arrays
• Problems of tradition IR models
– Documents and words are assumed.
– Keywords must be extracted from the text (indexing).
– Queries are restricted to keywords.
• New indices for text
– A text is regarded as a long string.
– Each position corresponds to a semi-infinite string
(sistring).
– suffix: a string that goes from a text position to the end
of the text
– Each suffix is uniquely identified by its position
– no structures and no keywords
Hsin-Hsi Chen
8-28
Text
This is a text. A text has many words. Words are made from letters.
text. A text has many words. Words are made from letters.
text has many words. Words are made from letters.
many words. Words are made from letters.
Suffixes
different
Words are made from letters.
made from letters.
Index points are selected from the text, which
point to the beginning of the text positions which
are retrievable.
Hsin-Hsi Chen
letters.
8-29
PATRICIA
• trie
– branch decision node: search decision-markers
– element node: real data
– if branch decisions are made on each bit, a
complete binary tree is formed where the depth
is equal to the number of bits of the longest
strings
– many element nodes and branch nodes are null
Hsin-Hsi Chen
8-30
PATRICIA (Continued)
• compressed digital search trie
– the null element nodes and branch nodes are
removed
– an additional field to denote the comparing bit
for branching decision is included in each
decision node
– a matching between the searched results and
their search keys is required because only some
of bits are compared during the search process
Hsin-Hsi Chen
8-31
PATRICIA (Continued)
• Practical Algorithm to Retrieve Information
Coded in Alphanumeric
– augmented branch node: an additional field for
storing elements is included in branch node
– each element is stored in an upper node or in
itself
– an addition root node: note the number of leaf
nodes is always greater than that of internal
nodes by one
Hsin-Hsi Chen
8-32
PAT-tree
• PATRICIA + semi-infinite strings
–
–
–
–
a text T with n basic units u1 u2 … un
u1 u2 … un …, u2 u3 … un …, u3 u4 … un …, …
an end to the left but none to the right
store the starting positions of semi-infinite
strings in a text using PATRICIA
Hsin-Hsi Chen
8-33
semi-infinite strings
• Example
Text
sistring 1
sistring 2
sistring 8
sistring 11
sistring 22
Once upon a time, in a far away land …
Once upon a time …
nce upon a time …
on a time, in a …
a time, in a far …
a far away land …
• Compare sistrings
22 < 11 < 2 < 8 < 1
Hsin-Hsi Chen
8-34
PAT Tree
• PAT Tree
A Patricia tree constructed over all the possible sistrings
of a text
• Patricia tree
– a digital tree where the individual bits of the keys are
used to decide on the branching
– each internal node indicates which bit of the query is
used for branching
• absolute bit position
• a count of the number of bits to skip
– each external node is a sistring, i.e., the integer
displacement
Hsin-Hsi Chen
8-35
1
Example
01100100010111 …
01100100010111 …
1100100010111 …
100100010111 …
00100010111 …
0100010111 …
100010111 …
00010111 …
0010111 ...
Text
sistring 1
sistring 2
sistring 3
sistring 4
sistring 5
sistring 6
sistring 7
sistring 8
1
1
0
1
0
2
1
1
2
0
Hsin-Hsi Chen
3
4
2
3
1
2
1
2
4
2
3
3
5
2
1
: external node sistring
(integer displacement)
total displacement of the
bit to be inspected
1
1
2
1
2
: internal node 8-36
skip counter & pointer
1
01100100010111 …
01100100010111 …
1100100010111 …
100100010111 … 3
00100010111 …
0100010111 … 7
100010111 …
00010111 …
0010111 ...
1
Text
sistring 1
sistring 2
sistring 3
sistring 4
sistring 5
sistring 6
sistring 7
sistring 8
2
4
4
5
Hsin-Hsi Chen
1 6
2
2
4
註:3和6要4個bits才能區辨
2
1 6
5
3
1
2
7
3
4
3
3
2
3
2
4
3
5
4
2
5
1 6
3
8
Search 00101
2
8-37
1
6 9 11
17 19
24 28
33
40
46 50
55
60
This is a text. A text has many words. Words are made from letters.
Text
Suffix Trie
‘l’ 60
50
‘d’
‘m’ ‘a’
‘n’ 28
‘t’
‘e’ ‘x’ ‘t’
‘w’
‘o’ ‘r
‘d’
space overhead:
120%~240% over
the text size
19
‘s’
11
40
33
Suffix Tree
60
‘l’
1 ‘m’
‘t’
‘w’
Hsin-Hsi Chen
‘d’ 50
3
‘n’ 28
5
6
19
11
40
8-38
33
PAT Trees Represented as Arrays
• indirect binary search vs. sequential search
Keep the external nodes in the bucket in the same relative
order as they would be in the tree
PAT array
1
7 4 8 5 1 6 3 2
2
3
7
4
3
5
4
2
5
1 6
2
3
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
8
Hsin-Hsi Chen
8-39
1
6 9 11
17 19
24 28
33
40
46 50
55
60
This is a text. A text has many words. Words are made from letters.
Text
60
‘l’
(1) Suffix Tree
50
‘d’
3
1 ‘m’
‘n’ 28
19
‘t’
120%~240%
5
overhead
11
‘w’
40
6
33
40% overhead
(2) Suffix Array
60
50
(3) Supra-Index lett
Suffix Array
Hsin-Hsi Chen
60
28
19
11
text
50
28
40
33
word
19
11
40
33
8-40
difference between suffix array and inverted list
• suffix array: the occurrences of each word are sorted
lexicographically by the text following the word
• inverted list: the occurrences of each word are sorted by
text position
1
6 9 11
17 19
24 28
33
40
46 50
55
60
This is a text. A text has many words. Words are made from letters.
Vocabulary letters
Supra-Index
made
many
text
words
Suffix Array
60
50
28
19
11
40
33
Inverted list
60
50
28
11
19
33
40
Hsin-Hsi Chen
8-41
Indexing Points
• The above example assumes every position in the
text is indexed.
n external nodes, one for each position in the text
• word and phrase searches
sistrings that are at the beginning of words are necessary
• trade-off between size of the index and search
requirements
Hsin-Hsi Chen
8-42
Prefix searching
• idea
every subtree of the PAT tree has all the sistrings
with a given prefix.
• Search: proportional to the query length
exhaust the prefix or up to external node.
Search for the prefix
“10100” and its answer
Hsin-Hsi Chen
8-43
Searching PAT Trees as Arrays
• Prefix searching and range searching
doing an indirect binary search over the array with the
results of the comparisons being less than, equal, and
greater than.
• example
Search for the prefix 100 and its answer.
PAT array
7 4 8 5 1 6 3 2
Hsin-Hsi Chen
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
8-44
Proximity Searching
• Find all places where s1 is at most a fixed (given
by a user) number of characters away from s2.
in 4 ation ==> insulation, international, information
• Algorithm
1. Search for s1 and s2.
2. Select the smaller answer set from these two sets and
sort by position.
3. Traverse the unsorted answer set, searching every
position in the sorted set and checking if the distance
between positions satisfying the proximity condition.
sort+traverse time:(m1+m2)logm1 (assume m1<m2)
Hsin-Hsi Chen
8-45
Range Searching
• Search for all the strings within a certain
lexicographical range.
• the range of “abc” ..”acc”:
“abracadabra”, “acacia” ○
“abacus”, “acrimonious” X
• Algorithm
– Search each end of the defining intervals.
– Collect all the sub-trees between (and including) them.
Hsin-Hsi Chen
8-46
Searching Suffix Array
• P1  S < P2
– Binary search both limiting patterns in the
suffix array.
– Find all the elements lying between both
positions
Hsin-Hsi Chen
8-47
Longest Repetition Searching
• the match between two different positions of a text where
this match is the longest in the entire text, e.g.,
0 1 1 0 0 1 0 0 0 1 0 1 1 1 the tallest internal node gives a pair
Text
sistring 1
sistring 2
sistring 3
sistring 4
sistring 5
sistring 6
sistring 7
sistring 8
Hsin-Hsi Chen
of sistrings that match for the greatest
number of characters
01100100010111
01100100010111
1100100010111
2
100100010111
00100010111
3
3
0100010111
7
100010111
5 5
00010111
4
8
0010111
1
2
4
1 6
2
3
8-48
“Most Significant” or “Most Frequent” Matching
• the most frequently occurring strings within the text
database, e.g., the most frequent trigram
• find the most frequent trigram
find the largest subtree at a distance 3 characters from root
the tallest internal node gives a pair
1
2
3
7
4
2
4
3
5
5
of sistrings that match for the greatest
number of characters
1 6
2
3
i.e., 1, 2, 3 are the same for
sistrings 100100010111
and 100010111
8
Hsin-Hsi Chen
8-49
Building PAT Trees as Patricia Trees
• bucketing of external nodes
– collect more than one external node
– a bucket replaces any subtree with size less than a
certain constraint (b)
save significant number of internal nodes
– the external nodes inside a bucket do not have any
structure associated with them
increase the number of comparisons for each search
Hsin-Hsi Chen
8-50
Building PAT Trees as Patricia Trees
(Continued)
• mapping the tree onto the disk using super-nodes
– Allocate as much as possible of the tree in a disk page
– Every disk page has a single entry point,
contains as much of the trees as possible,
and terminates either in external nodes or in pointers to
other disk pages
– The pointers in internal nodes address either a disk page
or another node inside the same page
– disk pages contain on the order of 1,000
internal/external nodes
– on the average, each disk page contains about 10 steps
of a root-to-leaf path
Hsin-Hsi Chen
8-51
Suffix array construction (in MM)
• The suffix array and the text must be in
main memory
– The suffix array is the set of pointers
lexicographically sorted
– The pointers are collected in ascending text
order
– The pointers are sorted by the text they point to
(accessing the text at random positions)
Hsin-Hsi Chen
8-52
Suffix array construction (in MM)
• Algorithm
– All the suffixes are bucket-sorted according to
the first letter only
– At iteration i, the suffixes begin already sorted
by their 2i-1 first letters and end up sorted by
their first 2i letters.
• Sort the text positions Ta … and Tb … in the suffix
array
• Determine the relative order between text positions
Ta+2i-1 … and Tb +2i-1 … in the current stage of search
Hsin-Hsi Chen
8-53
Construction of Suffix Arrays
for Large Text
• Split the text blocks that can be sorted in MM.
–
–
–
–
–
–
–
–
Build the suffix array for the first block
Build the suffix array for the second block
Merge both suffix arrays
Build the suffix array for the third block
Merge the suffix array with the previous one
Build the suffix array for the fourth block
Merge the new suffix array with previous one
…
Hsin-Hsi Chen
8-54
Merge Step
• How to merge a large suffix array with the small
suffix array?
– Determine how many elements of the large array are to
be placed between each pair of elements in the small
array
• Read the large array sequentially into main memory
• Each suffix of that text is searched in the small suffix array
• Increment appropriate counter
– Use the information to merge the arrays without
accessing the text
Hsin-Hsi Chen
8-55
small text
small text
(a)
(b)
small suffix array
small suffix array
local suffix array is built
counters
long
text
Counters are
computed
small text
long suffix array
(c)
small suffix array
counters
Hsin-Hsi Chen
final suffix array
8-56
The suffix arrays are merged
Signature Files
Hsin-Hsi Chen
8-57
Signature Files
• basic idea: inexact filter
– discard many of nonqualifying items
– qualifying items definitely pass the test
– “false hits” or “false drops” may also pass accidentally
• procedure
– Documents are stored sequentially in “text file”.
– Their signatures (hash-coded bit patterns) are stored in
the “signature file”.
– Scan the signature file, discard nonqualifying
documents, and check the rest, when a query arrives.
Hsin-Hsi Chen
8-58
Merits of Signature Files
• faster than full text scanning
– 1 or 2 orders of magnitude faster
• modest space overhead
– 10-15% vs. 50-300% (inversion)
• insertions can be handled more easily than
inversion
– append only
– no reorganization or rewriting
Hsin-Hsi Chen
8-59
Basic Concepts
•
•
•
•
•
Use superimposed coding to create signature.
Each document is divided into logical blocks.
A block contains D distinct non-common words.
B in text book
Each word yields “word signature”.
l
A word signature is a F-bit pattern, with m 1-bit.
– Each word is divided into successive, overlapping
triplets. e.g. free --> fr, fre, ree, ee 
– Each such triplet is hashed to a bit position.
• The word signatures are OR’ed to form block
signature.
• Block signatures are concatenated to form the
document signature.
Hsin-Hsi Chen
8-60
Basic Concepts (Continued)
B
l
• Example (D=2, F=12, m=4)
word
free
text
block signature
signature
001
000
000
010
001
010
110
101
111
010
001
011
• Search
– Use hash function to determine the m 1-bit positions.
– Examine each block signature for 1’s bit positions that
the signature of the search word has a 1.
Hsin-Hsi Chen
8-61
A Signature File
Block 1
Block 2
Block 3
Block 4
This is a text. A text has many words. Words are made from letters.
Text
000101
110101100100
101101
Text Signature
h(text)
h(many)
h(words)
h(made)
h(letters)
Hsin-Hsi Chen
=000101
=110000
=100100
=001100
=100001
8-62
Basic Concepts (Continued)
• false alarm (false hit, or false drop) Fd
the probability that a block signature seems to qualify,
given that the block does not actually qualify.
Fd = Prob{signature qualifies | block does not}
• Ensure the probability of a false alarm is low
enough while keeping the signature file as short as
possible
N*F binary matrix
• For a given value of F, the value of m that
minimizes the false drop probability is such that
each row of the matrix contains “1”s with
probability 0.5.
F: signature size in bits
Fd = 2-m
Fln2=mD
m=ln2*F/D
Hsin-Hsi Chen
m: number of bits per word
D: number of distinct noncommon words
per document
8-63
Fd: false drop probability
On the average, a word consists of 10 characters and
a character has 8 bits.
space overhead of index: (1/80)*(F/D)
F is measured in bits and D in words
10% overhead: false drop probability close to 2%
10%=(1/80)*(F/D)  (F/D)=8
m=8*ln2=5.545
Fd=2-5.545=2%
20% overhead: false drop probability close to 0.046%
20%=(1/80)*(F/D)  (F/D)=16
m=16*ln2=11.09
Fd=2-11.09=0.046%
Hsin-Hsi Chen
8-64
Sequential Signature File (SSF)
documents
the size of document signature
= the size of block signature=F
Hsin-Hsi Chen
assume documents span exactly one logical block
8-65
Classification of Signature-Based Methods
• Compression
If the signature matrix is deliberately sparse, it can be
compressed.
• Vertical partitioning
Storing the signature matrix columnwise improves the
response time on the expense of insertion time.
• Horizontal partitioning
Grouping similar signatures together and/or providing an
index on the signature matrix may result in better-thanlinear search.
Hsin-Hsi Chen
8-66
Classification of Signature-Based Methods
• Sequential storage of the signature matrix
– without compression
sequential signature files (SSF)
– with compression
bit-block compression (BC)
variable bit-block compression (VBC)
• Vertical partitioning
– without compression
bit-sliced signature files (BSSF, B’SSF)
frame sliced (FSSF)
generalized frame-sliced (GFSSF)
Hsin-Hsi Chen
8-67
Classification of Signature-Based Methods
(Continued)
– with compression
compressed bit slices (CBS)
doubly compressed bit slices (DCBS)
no-false-drop method (NFD)
• Horizontal partitioning
– data independent partitioning
Gustafson’s method
partitioned signature files
– data dependent partitioning
2-level signature files
5-trees
Hsin-Hsi Chen
8-68
Criteria
• the storage overhead
• the response time on single word queries
• the performance on insertion, as well as
whether the insertion maintains the
“append-only” property
Hsin-Hsi Chen
8-69
Compression
• idea
– Create sparse document signatures on purpose.
– Compress them before storing them sequentially.
• Method
– Use B-bit vector, where B is large.
– Hash each word into one (n) bit position(s).
– Use run-length encoding.
Hsin-Hsi Chen
8-70
Compression using run-length encoding
data
base
management
system
block signature
0000 0000 0000 0010 0000
0000 0001 0000 0000 0000
0000 1000 0000 0000 0000
0000 0000 0000 0000 1000
0000 1001 0000 0010 1000
L1
L2
L3
L4 L5
[L1] [L2] [L3] [L4] [L5]
where [x] is the encoded vale of x.
search: Decode the encoded lengths of all the preceding intervals
example: search “data”
(1) data ==> 0000 0000 0000 0010 0000
(2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000
disadvantage: search becomes low
Hsin-Hsi Chen
8-71
Bit-block Compression (BC)
Data Structure:
(1) The sparse vector is divided into groups of consecutive bits
(bit-blocks).
(2) Each bit block is encoded individually.
Algorithm:
Part I. It is one bit long, and it indicates whether there are any
“1”s in the bit-block (1) or the bit -block is (0). In
the latter case, the bit-block signature stops here.
0000 1001 0000 0010 1000
0
1
0
1
1
Part II. It indicates the number s of “1”s in the bit-block. It consists
of s-1 “1” and a terminating zero.
10
0
0
Part III. It contains the offsets of the “1”s from the beginning of the
bit-block.
0011
10 00
說明:b=4,距離為0, 1, 2, 3,編碼為00, 01, 10, 11
Hsin-Hsi Chen
8-72
block signature: 01011 | 10 00 | 00 11 10 00
Bit-block Compression (BC)
(Continued)
Search “data”
(1) data ==> 0000 0000 0000 0010 0000
(2) the 4th bit-block
(3) signature 01011 | 10 0 0 | 00 11 10 00
(4) OK, there is at least one setting in the 4th bit-block.
(5) Check furthermore. “0” tells us there is only one setting in
the 4th bit-clock. Is it the 3rd bit?
(6) Yes, “10” confirms the result.
Discussion:
(1) Bit-block compression requires less space than Sequential
Signature File for the same false drop probability.
(2) The response time of Bit-block compression is lightly less
than Sequential Signature File.
Hsin-Hsi Chen
8-73
Vertical Partitioning
• idea
avoid bringing useless portions of the document
signature in main memory
• methods
– store the signature file in a bit-sliced form or in
a frame-sliced form
– store the signature matrix column-wise to
improve the response time on the expense of
insertion time
Hsin-Hsi Chen
8-74
Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
(document signature)
documents
transpose
documents
represent
Hsin-Hsi Chen
8-75
documents
bit-files
search: (1) retrieve m bit vectors. (instead of F bit vectors)
e.g., the word signature of free is 001 000 110 010
the document contains “free”: 3rd, 7th, 8th, 11th bit are set
i.e., only 3rd, 7th, 8th, 11th files are examined.
(2) “and” these vectors. The 1s in the result N-bit vector
denote the qualifying logical blocks (documents).
(3) retrieve text file through pointer file.
insertion:
require F disk accesses for a new logical block (document),
Hsin-Hsi Chen
8-76
one for each bit-file, but no rewriting
Frame-Sliced Signature File (FSSF)
• Ideas
– random disk accesses are more expensive than
sequential ones.
– force each word to hash into bit positions that are closer
to each other in the document signature
– these bit files are stored together and can be retrieved
with a few random accesses
• Procedures
– The document signature (F bits long) is divided into k
frames of s consecutive bits each.
– For each word in the document, one of the k frames will
be chosen by a hash function.
– Using another hash function, the word sets m bits in
that frame.
Hsin-Hsi Chen
8-77
documents
frames
Each frame will be kept in consecutive disk blocks.
Hsin-Hsi Chen
8-78
FSSF (Continued)
• Example (D=2, F=12, s=6, k=2, m=3)
Word
free
text
doc. signature
Signature
000000 110010
010110 000000
010110 110010
• Search
– Only one frame has to be retrieved for a single word query.
I.E., only one random disk access is required.
e.g., search documents that contain the word “free”
because the word signature of “free” is placed in 2nd frame,
only 2nd frame has to be examined.
– At most n frames have to be scanned for an n word query.
• Insertion
Only k frames have to be accessed instead of F bit-slices.
Hsin-Hsi Chen
8-79
Vertical Partitioning and Compression
• idea
– create a very sparse signature matrix
– store it in a bit-sliced form
– compress each bit slice by storing the position
of the 1s in the slice.
Hsin-Hsi Chen
8-80
Compressed Bit Slices (CBS)
• Rooms for improvements for bit-sliced method
– Searching
• Each search word requires the retrieval m bit files.
• The search time could be improved if m was forced
to be “1”.
– Insertion
• Require too many disk accesses (equal to F, which
is typically 600-1000).
Hsin-Hsi Chen
8-81
Compressed Bit Slices (CBS)
(Continued)
• Let m=1. To maintain the same false drop
probability, F (S) has to be increased.
documents
one bit-setting for each word
Only one row has to be read
Size of a signature
Hsin-Hsi Chen
8-82
(document
collection)
representation
for a word:
取某一列,
擠掉0的部份
只保留該列
1的部份,
亦即將1的
部份串接起
來
Do not distinguish synonyms.
Hash a word to
obtainHsin-Hsi
bucket
address
Chen
h(“base”)=30
Obtain the pointers to the
relevant documents from
buckets
8-83
Doubly Compressed Bit Slices
Idea:
compress
the sparse
directory
當S變小
碰撞在一
起的的機會
變大,採用
中間buckets
為了區別
真碰撞和假
碰撞,多了
一個hash
function
Distinguish synonyms partially.
h1(“base”)=30
Hsin-Hsi Chen
Follow the pointers of posting
h2(“base”)=011 buckets to retrieve the qualifying
8-84
documents.
No False Drops Method
Fixed length
Save space
Distinguish
synonyms:
Hsin-Hsi Chen
completely.上圖h2仍有可能產生碰撞
Using pointer to the word
in the text file
8-85
Horizontal Partitioning
1. Goal: group the signatures into sets, partitioning the signature
matrix horizontally.
2. Grouping criterion
documents
Hsin-Hsi Chen
8-86
Partitioned Signature Files
• Using a portion of a document signature as a
signature key to partition the signature file.
• All signatures with the same key will be grouped
into a so-called “module”.
• When a query signature arrives,
– examine its signature key and look for the
corresponding modules
– scan all the signatures within those modules that have
been selected
Hsin-Hsi Chen
8-87
Comparisons
• signature files
– Use hasing techniques to produce an index
– advantage
• storage overhead is small (10%-20%)
– disadvantages
• the search time on the index is linear
• some answers may not match the query, thus
filtering must be done
Hsin-Hsi Chen
8-88
Comparisons (Continued)
• inverted files
– storage overhead (30% ~ 100%)
– search time for word searches is logarithmic
• PAT arrays
– potential use in other kind of searches
• phrases
• regular expression searching
• approximate string searching
• longest repetitions
• most frequent searching
Hsin-Hsi Chen
8-89