• Non ci sono risultati.

Index construction: Compression of postings

N/A
N/A
Protected

Academic year: 2021

Condividi "Index construction: Compression of postings"

Copied!
7
0
0

Testo completo

(1)

Index construction:

Compression of postings

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 5.3 and a paper

(2)

Delta encoding

Sec. 3.1

1 1 2 7 20 14 …

Then you compress the resulting integers with

variable-length prefix-free codes, as follows…

(3)

code

x > 0 and Binary length = log

2

x +1

e.g., 9 represented as 000 1001 .

 code for x takes 2 log

2

x +1 bits

(ie. factor of 2 from optimal)

0000...0 x in binary

Binary Length-1 Binary length

Optimal for Pr(x) = 1/2x

2

, and i.i.d integers

(4)

It is a prefix-free encoding…

Given the following sequence of coded integers, reconstruct the original

sequence:

0001000001100110000011101100111

8 6 3 59 7

(5)

code

Use -coding to reduce the length of the first field

Useful for medium-sized integers

e.g., 19 represented as 0010110011.

coding x takes about

log

2

x + 2 log

2

( log

2

x ) + 2 bits.

(BinLen) x in binary

Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers

Binary length

(6)

Variable-bytecodes

Wish to get very fast (de)compress  byte-align

Given a binary representation of an integer

Append 0s to front, to get a multiple-of-7 number of bits

Form groups of 7-bits each

Append to the last group the bit 0, and to the other groups the bit 1 (tagging)

e.g., v=2

14

+1  binary(v) = 100000000000001

10000001 10000000 00000001

Note: We waste 1 bit per byte, and avg 4 for the first byte.

But it is a prefix code, and encodes also the value 0 !!

T-nibble: We could design this code over t-bits, not just t=8

(7)

PForDelta coding

10 01 10 01 01 … 10 10 01 42 23

10 10

2 1 2 1 1 … 2 2 23 1

2 42 2

a block of 128 numbers = 256 bits = 32 bytes

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

Encode exceptions with value 2b-1

Choose b to encode 90% values, or trade-off:

b waste more bits, b more exceptions

Translate data: [base, base + 2b-2]  [0,2b

-2

]

Riferimenti

Documenti correlati

Per i muri intonacati la tipologia è stata ipotizzata sulla base degli spessori, del periodo di costruzione e della vicinanza a muri noti.. La tipologia della muratura è riportata

La somma dei valori assoluti degli scarti è più plausibile, cosí come la somma di qualunque funzione pari degli scarti. Si sceglie la somma dei quadrati degli scarti come funzione

[Wrong greedy for minimum vertex cover] Find an example of (family of) graphs for which the following greedy algorithm fails to give a 2-approximation for the minimum vertex

 Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list.  Merge is the same as a simple append [new docIDs

Dipartimento di Informatica Università di Pisa..

Dipartimento di Informatica Università di Pisa..

Dipartimento di Informatica Università di Pisa..

In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits?. Fax, bzip,… are