Index construction: Compression of documents

(1)

Index construction:

Compression of documents

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

(2)

Raw docs are needed

(3)

Various Approaches

Statistical coding

 Huffman codes

 Arithmetic codes

Dictionary coding

 LZ77, LZ78, LZSS,…

 Gzip, zippy, snappy,…

Text transforms

 Burrows-Wheeler Transform

 bzip

(4)

Basics of Data Compression

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(5)

Uniquely Decodable Codes

A variable length code assigns a bit string (codeword) of variable length to every

symbol

e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ?

A uniquely decodable code can always be uniquely decomposed into their codewords.

(6)

Prefix Codes

A prefix code is a variable length code in

which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11

Can be viewed as a binary trie

0 1

a

b c

d 0

0 1 1

(7)

Average Length

For a code C with codeword length L[s], the average length is defined as

p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]

L_a= .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)

We say that a prefix code C is optimal if for all prefix codes C ^{’ , L}a(C)  L_a(C’ )









S s

a

C p s L s

L ( ) ( ) [ ]

(8)

Entropy (Shannon, 1948)

For a source S emitting symbols with

probability p(s), the self information of s is:

bits

Lower probability  higher information Entropy is the weighted average of i(s)









S

s p s p s

S

H ( )

log 1 )

( )

( ₂

) ( log 1

)

(

₂

s s p

i 









s s

s

occ T T

T occ

H | |

| log ) |

( ₂

0

0-th order empirical entropy of string T

i(s)

0 <= H <= log ||

H -> 0, skewed distribution H max, uniform distribution

(9)

Performance: Compression ratio

Compression ratio =

#bits in output

/

#bits in input

Compression performance: We relate entropy against compression ratio.

p(A) = .7, p(B) = p(C) = p(D) = .1

H ≈ 1.36 bits

Huffman ≈ 1.5 bits per symb

|

| ) ( ) |

0(

T T vs C

T



^ H



s

s c s

p S

H( ) ( ) | ( ) |

Shannon Avg cw length In practice Empirical H vs Compression ratio

| ) (

| )

(

|

|T H₀ T vs C T

An optimal code is surely one that…

(10)

Document Compression

Huffman coding

(11)

Huffman Codes

Invented by Huffman as a class assignment in ‘ 50.

Used in most compression algorithms

 gzip, bzip, jpeg (as option), fax compression,…

Properties:

 Generates optimal prefix codes

 Cheap to encode and decode

 La(Huff) = H if probabilities are powers of 2

 Otherwise, L_a(Huff) < H +1  < +1 bit per symb on avg!!

(12)

Running Example

p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5

a(.1) b(.2) c(.2) d(.5)

a=000, b=001, c=01, d=1

There are 2

^n-1

“equivalent” Huffman trees

(.3)

(.5)

(1)

What about ties (and thus, tree depth) ?

0 0

0

1 1 1

(13)

Encoding and Decoding

Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.

Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.

a(.1) b(.2)

(.3) c(.2) (.5) d(.5) 0

0 0

1 1 1

abc... 00000101 101001...  dcb

(14)

Huffman in practice

The compressed file of n symbols, consists of:

 Preamble: tree encoding + symbols in leaves

 Body: compressed text of n symbols

Preamble =

(|| log ||)

^bits

Body is at least nH bits and at most nH+n bits

Extra +n is bad for very skewed distributions, namely ones for which H -> 0

Example: p(a) = 1/n, p(b) = n-1/n

(15)

There are better choices

T=aaaaaaaaab

 Huffman = {a,b}-encoding + 10 bits

 RLE = <a,9><b,1> = (9) + (1) + {a,b}- encoding

= 0001001 1 + {a,b}-encoding

So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.

Fax, bzip,… are using RLE

(16)

Idea on Huffman?

Goal: Reduce the impact of the +1 bit

Solution:

 Divide the text into blocks of k symbols

 The +1 is spread over k symbols

 So the loss is 1/k per symbol

Caution: Alphabet = ^k^,tree gets larger, and so

preamble. At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only.

This means no compression at all !

(17)

Document Compression

Arithmetic coding

(18)

Introduction

It uses “ fractional” parts of bits!!

Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman)

Used in JPEG/MPEG (as option), bzip

More time costly than Huffman, but integer implementation is not too bad.

Ideal performance.

In practice, it is 0.02 * n

(19)

Symbol interval

Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).

e.g.

a = .2 c = .3

b = .5









cum[c] = p(a)+p(b) = .7

cum[b] = p(a) = .2 cum[a] = .0

The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))

(20)

Sequence interval

Coding the message sequence: bac

The final sequence interval is [.27,.3)

a = .2 c = .3

b = .5

0.0 0.2 0.7 1.0

a c

b

0.2 0.3 0.55

0.7

a c

b

0.2 0.22 0.27 0.3

(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02

(0.7-0.2)*0.2=0.1 (0.7-0.2)*0.5 = 0.25

(21)

The algorithm

To code a sequence of symbols with probabilities

p

_i (i = 1..n) use the following algorithm:

0 1

0 0



 l

s  

 

_i

i i

i

i i

i

T cum s

l l

T p s

s

1 * 1

1

*







p(a) = .2 p(c) = .3

p(b) = .5 0.27

0.2 0.3

2 . 0

1 . 0

1 1





 i

i

l s

03 . 0 3 . 0

* 1 .

0 

i  s

27 . 0 ) 5 . 0 2 . 0 (

* 1 . 0 2 .

0   

i  l

(22)

The algorithm

Each message narrows the interval by a factor

p[T

_i

]

Final interval size is

1 0

0 0



 s l

  



ⁿ

i

n

p T

s

1

 

_i

 

i i

T p s

s

T cum s

l l

1

*

1 * 1









Sequence interval [ l_n, l_n+ s_n )

Take a number inside

(23)

Decoding Example

Decoding the number .49, knowing the message is of length 3:

The message is bbc.

a = .2 c = .3

b = .5

0.0 0.2 0.7 1.0

a c

b

0.2 0.3 0.55

0.7

a c

b

0.3 0.35 0.475

0.55

0.49 0.49

0.49

(24)

How do we encode that number?

If x = v/2

^k

(dyadic fraction) then the encoding is equal to bin(v) ^over k

digits (possibly padded with 0s in front)

0111 .

16 /

7 11 .

4 / 3



01 . 3

/

1 

(25)

How do we encode that number?

Binary fractional representation:

FractionalEncode(x)

1. x = 2 * x

2. If x < 1 output 0

3. else {output 1; x = x - 1; }

...

. b

₁

b

₂

b

₃

b

₄

b

₅

...

2 2

2

¹ ₂ ² ₃ ³ ₄ ⁴

1



^

 b 

^

 b 

^

 b 

^



b

01 . 3

/

1 

2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1  4/3 – 1 = 1/3

Incremental Generation

(26)

Which number do we encode?

Truncate the encoding to the first d

=

^log2 (2/s_n) bits Truncation gets a smaller number… how much smaller?

Truncation  Compression

d i

i d

i

i d i

bd ^ ^





 





 





   



 







² ¹ ² ² ² ²

1 1

) ( 1

) (

2 2 2

log 2

log₂ 2 ₂

s n

sn ^^  ^ _n  s

 





l_n + s_n

l_n

l_n + s_n/2

...

.

₁ ₂ ₃ ₄ ₅ _₁ _₂ _₃

 b b b b b b

_d

b

_d

b

_d

b

_d

x ₀

^∞

(27)

Bound on code length

Theorem: For a text T of length n, the Arithmetic encoder generates at most

log₂ (2/s_n) < 1 + log₂ 2/s_n = 1 + (1 - log₂ s_n)

= 2 - log2

(∏

i=1,n p(Ti)

)

= 2 - log2

(∏

 [p()^occ()

])

= 2 - ∑ occ() * log₂ p()

≈ 2 + ∑ ( n*p() ) * log2 (1/p()) = 2 + n H(T) bits

T = acabc

s_n = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)² * p(b) * p(c)²

(28)

Document Compression

Dictionary-based compressors

(29)

LZ77

Algorithm’ s step:

 Output <dist, len, next-char>

 Advance by len + 1

A buffer “ window” has fixed length and moves

a a c a a c a b c a a a a a a Dictionary

(all substrings starting here)

<6,3,a>

<3,4,c>

a a c a a c a b c a a a a a a c a c

a c

(30)

LZ77 Decoding

Decoder keeps same dictionary window as encoder.

 Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)

 E.g. seen = abcd, next codeword is (2,9,e)

 Simply copy starting at the cursor

for (i = 0; i < len; i++)

out[cursor+i] = out[cursor-d+i]

 Output is correct: abcdcdcdcdcdce

(31)

LZ77 Optimizations used by gzip

LZSS: Output one of the following formats

(0, position, length) or (1,char)

Typically uses the second format if length <

Special greedy: possibly use shorter match so 3.

that next match is better

Hash Table for speed-up searches on triplets Triples are coded with Huffman’ s code

(32)

You find this at: www.gzip.org/zlib/

(33)