x - DidaWiki - Università di Pisa
Download
Report
Transcript x - DidaWiki - Università di Pisa
Index construction:
Compression of postings
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Reading 5.3 and a paper
g-code for integer encoding
0000...........0 x in binary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
d-code for integer encoding
g(Length)
x
Use g-coding to reduce the length of the first field
Useful for medium-sized integers
e.g., 19 represented as <00,101,10011>.
d-coding x takes about
log2 x + 2 log2( log2 x ) + 2 bits.
Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers
Variable-byte codes
[10.2 bits per TREC12]
Wish to get very fast (de)compress byte-align
Given a binary representation of an integer
Append 0s to front, to get a multiple-of-7 number of bits
Form groups of 7-bits each
Append to the last group the bit 0, and to the other
groups the bit 1 (tagging)
e.g., v=214+1 binary(v) = 100000000000001
10000001 10000000 00000001
Note: We waste 1 bit per byte, and avg 4 for the first byte.
But it is a prefix code, and encodes also the value 0 !!
PForDelta coding
Use b (e.g. 2) bits to encode 128 numbers or create exceptions
3
11
42 2
3
3
1
1
…
10 11 11 01 01 …
3
3
23 1
11 11
a block of 128 numbers = 256 bits = 32 bytes
Translate data: [base, base + 2b-1] [0,2b-1]
Encode exceptions: ESC or pointers
Choose b to encode 90% values, or trade-off:
b waste more bits, b more exceptions
2
01 10 42 23
Random access to postings lists
and other data types
(e.g. encoding skips?)
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
A basic problem !
T Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#....
• Array of pointers
• (log m) bits per string = (n log m) bits= 32 n bits.
• We could drop the separating NULL
Independent of string-length distribution
It is effective for few strings
It is bad for medium/large sets of strings
A basic problem !
T Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#....
X AbacoBattleCarColdCodDefenseGoogleYahoo....
B 10000100000100100010010000001000010000....
A 10#2#5#6#20#31#3#3#....
We could
drop msb
X 1010101011101010111111111....
B 1000101001001000100001010....
We aim at achieving ≈ n log(m/n) bits < n log m
Another textDB: Labeled Graph
Rank/Select
Wish to index the bit vector B compressed.
Select1(3) = 8
B 00101001010101011111110000011010101....
Rank1(6) = 2
• Rankb(i)
= number of b in B[1,i]
m = |B|
n = #1
• Selectb(i) = position of the i-th b in B
Do exist data structures that solve this problem in
O(1) query time and efficient space (i.e. +o(m) bits additional)
The Bit-Vector Index: m+o(m)
m = |B|
n = #1s
Goal. B is read-only, and the additional index takes o(m) bits.
Rank
B 00101001010101011 1111100010110101 0101010111000....
Z
8
4
5
8
18
z
(absolute) Rank1
(bucket-relative) Rank1
Setting Z = poly(log m) and z=(1/2) log m:
Space is |B| + (m/Z) log m + (m/z) log Z + o(m)
block pos
#1
0000 1
0
....
...
...
1011
2
1
....
m + O(m loglog m / log m) bits
Rank time is O(1)
Term o(m) is crucial in practice, B is untouched (not compressed)
The Bit-Vector Index
m = |B|
n = #1s
B 0010100101010101111111000001101010101010111001....
size r is variable k consecutive 1s
Sparse case: If r > k2 store explicitly the position of the k 1s
Dense case: k ≤ r ≤ k2, recurse... One level is enough!!
... still need a table of size o(m).
Setting k ≈ polylog m
Space is m + o(m), and B is not touched!
Select time is O(1)
There exists a Bit-Vector Index
taking o(m) extra bits
and constant time for Rank/Select.
B is read-only!
Elias-Fano index&compress
z = 3, w=2
If w = log (m/n) and z = log n, where m = |B| and n = #1
then
- L takes n w = n log (m/n) bits
- H takes n 1s + n 0s = 2n bits
0
In unary
1
2 3 4 5
6
7
(Select1
on H)
Select1(i) on B uses L and (Select1(H,i) – i) in +o(n) space
Actually you can do binary search over B, but compressed !
If you wish to play with Rank and Select
m/10 + n log m/n
Rank in 0.4 msec, Select in < 1 msec
vs 32n bits of explicit pointers