• Non ci sono risultati.

Index construction: Compression of postings

N/A
N/A
Protected

Academic year: 2021

Condividi "Index construction: Compression of postings"

Copied!
15
0
0

Testo completo

(1)

Index construction:

Compression of postings

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(2)

Delta encoding

Sec. 3.1

1 1 2 7 20 14 …

Then you compress the resulting integers with

variable-length prefix-free codes, as follows…

(3)

Variable-bytecodes

Wish to get very fast (de)compress  byte-align

Given a binary representation of an integer

Append 0s to front, to get a multiple-of-7 number of bits

Form groups of 7-bits each

Append to the last group the bit 0, and to the other groups the bit 1 (tagging)

e.g., v=214+1  binary(v) = 100000000000001 10000001 10000000 00000001

Note: We waste 1 bit per byte, and avg 4 for the first byte.

But it is a prefix code, and encodes also the value 0 !!

T-nibble: We could design this code over t-bits, not just t=8

(4)

PForDelta coding

10 01 10 01 01 … 10 10 01 42 23

10 10

2 1 2 1 1 … 2 2 23 1

2 42 2

a block of 128 numbers = 256 bits = 32 bytes

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

Encode exceptions with value 2b-1

Choose b to encode 90% values, or trade-off:

b waste more bits, b more exceptions

Translate data: [base, base + 2b-2]  [0,2b

-2

]

(5)

code

x > 0 and Binary length = log

2

x +1

e.g., 9 represented as 000 1001 .

 code for x takes 2 log

2

x +1 bits

(ie. factor of 2 from optimal)

0000...0 x in binary

Binary Length-1 Binary length

Optimal for Pr(x) = 1/2x

2

, and i.i.d integers

(6)

It is a prefix-free encoding…

Given the following sequence of coded integers, reconstruct the original

sequence:

0001000001100110000011101100111

8 6 3 59 7

(7)

code

Use -coding to reduce the length of the first field

Useful for medium-sized integers

e.g., 19 represented as 0010110011.

coding x takes about

log

2

x + 2 log

2

( log

2

x ) + 2 bits.

(BinLen) x in binary

Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers

Binary length

(8)

z = 3, w=2

Elias-Fano

If w = log (m/n) and z = log n, where m = |B| and n = #1 then

- L takes n w = n log (m/n) bits

- H takes n 1s + n 0s = 2n bits

0 1 2 3 4 5 6 7

In unary (

Select

1

on H)

How to get the i-th number ? Take the i-th group of w bits in L and then represent the value (Select1(H,i) – i) in z bits

(9)

Rank and Select

data structures

(10)

A basic problem !

Abaco, Battle, Car, Cold, Cod ....

D

10000 100000 100 1000 100 ....

B

Array of n string pointers to strings of total length m

(n log m) bits = 32 n bits.

it depends on the number of strings

it is independent of string length

Abaco Battle Car Cold Cod ....

D

Spaces are introduced for simplicity

(11)

Rank/Select

00101001010101011111110000011010101....

B

Rank

b

(i) = number of b in B[1,i]

Select

b

(i) = position of the i-th b in B

Rank1(6) = 2 Select1(3) = 8

m = |B|

n = #1

Two approaches:

(1)Takes |B| + o(|B|) bits of space,

(2)(2) Aims at achieving n log(m/n) bits

Wish to index the bit vector B (possibly

compressed).

(12)

The Bit-Vector Index: |B| + o(|

B|)

m = |B|

n = #1s

Goal. B is read-only, and the additional index takes o(m) bits.

00101001010101011 1111100010110101 0101010111000....

B

Z

8 18

(absolute) Rank1

Setting Z = poly(log m) and z=(1/2) log m:

Extra space is + (m/Z) log m + (m/z) log Z + o(m)

+ O(m loglog m / log m) = o(m) bits

Rank time is O(1)

Term o(m) is crucial in practice, B is untouched (not compressed)

0000 1 0 .... ... ...

1011 2 1 ....

block pos #1

z

(bucket-relative) Rank1

4 5 8

Rank

There exists a Bit-Vector Index taking o(m) extra bits

and constant time for Rank/Select.

B is needed and read-only!

(13)

The Select operation

m = |B|

n = #1s

0010100101010101111111000001101010101010111001....

B

size r is variable until the subarray includes k 1s

Sparse case: If r > k2 , we store explicitly the position of the k 1s

Dense case: k ≤ r ≤ k2, recurse... One level is enough!!

... still need a table of size o(m).

Setting k ≈ polylog m

Extra space is + o(m), and B is not touched!

Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits

and constant time for Rank/Select.

B is needed and read-only!

(14)

z = 3, w=2

Via Elias-Fano (B is not needed)

Recall that by setting w = log (m/n) and z = log n, where m = |B| and n = #1 then

- Space = n log (m/n) bits + 2n bits

0 1 2 3 4 5 6 7

Rank1(i) on B  Needs binary search over B

Select1(i) on B  uses L and (Select1(H,i) – i) in +o(n) space

(Build Select

1

on H so we need extra

|H| + o(|H|) bits

= 2n + o(n) bits )

(15)

If you wish to play with Rank and Select

m/10 + n log (m/n)

Rank in 0.4 sec, Select in < 1 sec vs 32n bits of explicit pointers

Riferimenti

Documenti correlati

- con cadenza bisettimanale ed entro il mercoledì della settimana precedente la frequenza degli insegnamenti prescelti, prenotare la partecipazione «in presenza» alle lezioni.

La somma dei valori assoluti degli scarti è più plausibile, cosí come la somma di qualunque funzione pari degli scarti. Si sceglie la somma dei quadrati degli scarti come funzione

[Wrong greedy for minimum vertex cover] Find an example of (family of) graphs for which the following greedy algorithm fails to give a 2-approximation for the minimum vertex

 Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list.  Merge is the same as a simple append [new docIDs

Dipartimento di Informatica Università di Pisa..

Dipartimento di Informatica Università di Pisa..

Dipartimento di Informatica Università di Pisa. Reading 5.3 and

In fact it does not map symbol -&gt; bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits?. Fax, bzip,… are