Index construction:
Compression of postings
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Gap encoding
Sec. 3.1
1 1 2 7 20 14 …
Then you compress the resulting integers with
variable-length prefix-free codes, as follows…
Variable-bytecodes
Wish to get very fast (de)compress byte-align
Given a binary representation of an integer
Append 0s to front, to get a multiple-of-7 number of bits
Form groups of 7-bits each
Append to the last group the bit 0, and to the other groups the bit 1 (tagging)
e.g., v=214+1 binary(v) = 100000000000001 10000001 10000000 00000001
Note: We waste 1 bit per byte, and avg 4 for the first byte.
But it is a prefix code, and encodes also the value 0 !!
T-nibble: We could design this code over t-bits, not just t=8
PForDelta coding
10 01 10 01 01 … 10 10 01 42 23
10 10
2 1 2 1 1 … 2 2 23 1
2 42 2
a block of 128 numbers = 256 bits = 32 bytes
Use b (e.g. 2) bits to encode 128 numbers (32 bytes) or exceptions
Encode exceptions with value 2b-1
Choose b to encode 90% values, or trade-off:
b waste more bits, b more exceptions
Translate data: [base, base + 2b-2] [0,2b
- 2
]code
x > 0 and Binary length = log
2x +1
e.g., 9 represented as 000 1001 .
code for x takes 2 log
2x +1 bits
(ie. factor of 2 from optimal)
0000...0 x in binary
Binary Length-1 Binary length
Optimal for Pr(x) = 1/2x
2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of coded integers, reconstruct the original
sequence:
0001000001100110000011101100111
8 6 3 59 7
z = 3, w=2
Elias-Fano
If w = log (m/n) and z = log n, where m = |B| and n = #1 then
- L takes n w = n log (m/n) bits
- H takes n 1s + n 0s = 2n bits
0 1 2 3 4 5 6 7
In unary (
Select
1on H)
How to get the i-th number ? Take the i-th group of w bits in L and then represent the value (Select1(H,i) – i) in z bits
Rank and Select
data structures
A basic problem !
Abaco, Battle, Car, Cold, Cod ....
D
10000 100000 100 1000 100 ....
B
Array of n string pointers to strings of total length m
•
(n log m) bits = 32 n bits.
•
it depends on the number of strings
•
it is independent of string length
Abaco Battle Car Cold Cod ....
D
Spaces are introduced for simplicity
Rank/Select
00101001010101011111110000011010101....
B
•
Rank
b(i) = number of b in B[1,i]
•
Select
b(i) = position of the i-th b in B
Rank1(6) = 2 Select1(3) = 8
m = |B|
n = #1
Two approaches:
(1)Takes |B| + o(|B|) bits of space,
(2)Aims at achieving n log(m/n) bits, by deplyoing Elias-Fano + point (1)
Wish to index the bit vector B (possibly
compressed).
The Bit-Vector Index: |B| + o(|
B|)
m = |B|
n = #1s
Goal. B is read-only, and the additional index takes o(m) bits.
00101001010101011 1111100010110101 0101010111000....
B
Z
8 18(absolute) Rank1
Setting Z = poly(log m) and z=(1/2) log m:
Extra space is + (m/Z) log m + (m/z) log Z + o(m)
+ O(m loglog m / log m) = o(m) bits
Rank time is O(1)
Term o(m) is crucial in practice, B is untouched (not compressed)
0000 1 0 .... ... ...
1011 2 1 ....
block pos #1
z
(bucket-relative) Rank1
4 5 8
Rank
There exists a Bit-Vector Index taking o(m) extra bits
and constant time for Rank/Select.
B is needed and read-only!
The Select operation
m = |B|
n = #1s
0010100101010101111111000001101010101010111001....
B
size r is variable until the subarray includes k = (log m)2 1s
Sparse case: If r > k2 = (log m)4 , we store explicitly the position of the k
= (log m)2 1s, because we have at most (m/r) blocks of this type, each taking (m/r) * k * log m bits = O(m / log m) = o(m) bits
Dense case: k ≤ r ≤ k2, recurse by repeating the argument with now k’ = (log log m)2. If r’ including k’ 1s > log m bits, then store the k’ positions explicitly using O(log log m) bits each, thus O(m/log log m) = o(m) bits in total. Otherwise r’ < log m, and thus a precomputed table is enough.
Extra space is + o(m), and B is not touched!
Select time is O(1)
There exists a Bit-Vector Index taking o(m) extra bits
and constant time for Rank/Select.
B is needed and read-only!
z = 3, w=2
Via Elias-Fano (B is not needed)
Recall that by setting w = log (m/n) and z = log n, where m = |B| and n = #1 then
- Space = n log (m/n) bits + 2n bits
0 1 2 3 4 5 6 7
Rank1(i) on B Needs binary search over B
Select1(i) on B uses L and (Select1(H,i) – i) in +o(n) space
(Build Select
1on H so we need extra
|H| + o(|H|) bits
= 2n + o(n) bits )
If you wish to play with Rank and Select
m/10 + n log (m/n)
Rank in 0.4 sec, Select in < 1 sec vs 32n bits of explicit pointers