• Non ci sono risultati.

Basics of Data Compression

N/A
N/A
Protected

Academic year: 2021

Condividi "Basics of Data Compression"

Copied!
39
0
0

Testo completo

(1)

Basics of Data Compression

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(2)

Uniquely Decodable Codes

A variable length code assigns a bit string (codeword) of variable length to every

symbol

e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ?

A uniquely decodable code can always be uniquely decomposed into their codewords.

(3)

Prefix Codes

A prefix code is a variable length code in

which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11

Can be viewed as a binary trie

0 1

a

b c

d 0

0 1 1

(4)

Average Length

For a code C with codeword length L[s], the average length is defined as

p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]

La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)

We say that a prefix code C is optimal if for all prefix codes C ’ , La(C) La(C’ )

S s

a

C p s L s

L ( ) ( ) [ ]

(5)

Entropy (Shannon, 1948)

For a source S emitting symbols with

probability p(s), the self information of s is:

bits

Lower probability  higher information Entropy is the weighted average of i(s)

S

s p s p s

S

H ( )

log 1 )

( )

( 2

) ( log 1

)

(

2

s s p

i

s s

s

occ T T

T occ

H | |

| log ) |

( 2

0

0-th order empirical entropy of string T

i(s)

0 <= H <= log ||

H -> 0, skewed distribution H max, uniform distribution

(6)

Performance: Compression ratio

Compression ratio =

#bits in output

/

#bits in input

Compression performance: We relate entropy against compression ratio.

p(A) = .7, p(B) = p(C) = p(D) = .1

H ≈ 1.36 bits

Huffman ≈ 1.5 bits per symb

|

|

| ) ( ) |

0(

T T vs C

T

H

s

s c s

p S

H( ) ( ) | ( ) |

Shannon Avg cw length In practice Empirical H vs Compression ratio

| ) (

| )

(

|

|T H0 T vs C T

An optimal code is surely one that…

(7)

Index construction:

Compression of postings

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 5.3 and a paper

(8)

code for integer encoding

x > 0 and Length = log2 x +1

e.g., 9 represented as <000,1001>.

code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)

0000...0 x in binary

Length-1

Optimal for Pr(x) = 1/2x2, and i.i.d integers

(9)

It is a prefix-free encoding…

Given the following sequence of coded integers, reconstruct the original

sequence:

0001000001100110000011101100111

8 6 3 59 7

(10)

code for integer encoding

Use -coding to reduce the length of the first field

Useful for medium-sized integers

e.g., 19 represented as <00,101,10011>.

coding x takes about

log2 x + 2 log2( log2 x ) + 2 bits.

(Length) x

Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers

(11)

Variable-bytecodes

[10.2 bits per TREC12]

Wish to get very fast (de)compress  byte-align

Given a binary representation of an integer

Append 0s to front, to get a multiple-of-7 number of bits

Form groups of 7-bits each

Append to the last group the bit 0, and to the other groups the bit 1 (tagging)

e.g., v=214+1  binary(v) = 100000000000001 10000001 10000000 00000001

Note: We waste 1 bit per byte, and avg 4 for the first byte.

But it is a prefix code, and encodes also the value 0 !!

(12)

PForDelta coding

10 11 11 01 01 11 11 01 42 23

11 10

2 3 3 1 1 3 3 23 1

3 42 2

a block of 128 numbers = 256 bits = 32 bytes

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

Encode exceptions: ESC or pointers

Choose b to encode 90% values, or trade-off:

b waste more bits, b more exceptions

Translate data: [base, base + 2b-1]  [0,2b-1]

(13)

Index construction:

Compression of documents

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

(14)

Raw docs are needed

(15)

Various Approaches

Statistical coding

Huffman codes

Arithmetic codes

Dictionary coding

LZ77, LZ78, LZSS,…

Gzip, zippy, snappy,…

Text transforms

Burrows-Wheeler Transform

bzip

(16)

Document Compression

Huffman coding

(17)

Huffman Codes

Invented by Huffman as a class assignment in ‘ 50.

Used in most compression algorithms

gzip, bzip, jpeg (as option), fax compression,…

Properties:

Generates optimal prefix codes

Cheap to encode and decode

La(Huff) = H if probabilities are powers of 2

Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!

(18)

Running Example

p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5

a(.1) b(.2) c(.2) d(.5)

a=000, b=001, c=01, d=1

There are 2

n-1

“equivalent” Huffman trees

(.3)

(.5)

(1)

What about ties (and thus, tree depth) ?

0 0

0

1 1 1

(19)

Encoding and Decoding

Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.

Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.

a(.1) b(.2)

(.3) c(.2) (.5) d(.5) 0

0 0

1 1 1

abc... 00000101 101001...  dcb

(20)

Huffman in practice

The compressed file of n symbols, consists of:

Preamble: tree encoding + symbols in leaves

Body: compressed text of n symbols

Preamble =

(|| log ||)

bits

Body is at least nH and at most nH+n bits

Extra +n is bad for very skewed distributions, namely ones for which H -> 0

Example: p(a) = 1/n, p(b) = n-1/n

(21)

There are better choices

T=aaaaaaaaab

Huffman = {a,b}-encoding + 10 bits

RLE = <a,9><b,1> = (9) + (1) + {a,b}- encoding

= 0001001 1 + {a,b}-encoding

So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.

(22)

Idea on Huffman?

Goal: Reduce the impact of the +1 bit

Solution:

Divide the text into blocks of k symbols

The +1 is spread over k symbols

So the loss is 1/k per symbol

Caution: Alphabet = k, preamble gets larger.

At the limit, preamble = 1 block equal to the input text, and compressed text is 1 bit only.

No compression!

(23)

Document Compression

Arithmetic coding

(24)

Introduction

It uses “ fractional” parts of bits!!

Gets nH(T) + 2 bits vs. nH(T)+n of Huffman

Used in JPEG/MPEG (as option), Bzip

More time costly than Huffman, but integer implementation is not too bad.

Ideal performance.

In practice, it is 0.02 * n

(25)

Symbol interval

Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).

e.g.

a = .2 c = .3

b = .5









cum[c] = p(a)+p(b) = .7

cum[b] = p(a) = .2 cum[a] = .0

The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))

(26)

Sequence interval

Coding the message sequence: bac

The final sequence interval is [.27,.3)

a = .2 c = .3

b = .5

0.0 0.2 0.7 1.0

a c

b

0.2 0.3 0.55

0.7

a c

b

0.2 0.22 0.27 0.3

(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02

(0.7-0.2)*0.2=0.1 (0.7-0.2)*0.5 = 0.25

(27)

The algorithm

To code a sequence of symbols with probabilities

p

i (i = 1..n) use the following algorithm:

0 1

0 0

l

s  

 

i

i i

i

i i

i

T cum s

l l

T p s

s

1 * 1

1

*

p(a) = .2 p(c) = .3

p(b) = .5 0.27

0.2 0.3

2 . 0

1 . 0

1 1

i

i

l s

03 . 0 3 . 0

* 1 .

0

i s

27 . 0 ) 5 . 0 2 . 0 (

* 1 . 0 2 .

0

i l

(28)

The algorithm

Each message narrows the interval by a factor

p[T

i

]

Final interval size is

1 0

0 0

s l

  

n

i

i

n

p T

s

1

 

i

 

i i

i i

i i

T p s

s

T cum s

l l

1

*

1 * 1

Sequence interval [ ln , ln + sn )

Take a number inside

(29)

Decoding Example

Decoding the number .49, knowing the message is of length 3:

The message is bbc.

a = .2 c = .3

b = .5

0.0 0.2 0.7 1.0

a c

b

0.2 0.3 0.55

0.7

a c

b

0.3 0.35 0.475

0.55

0.49 0.49

0.49

(30)

How do we encode that number?

If x = v/2

k

(dyadic fraction) then the encoding is equal to bin(v) over k

digits (possibly padded with 0s in front)

0111 .

16 /

7

11 .

4 / 3

01 . 3

/

1 

(31)

How do we encode that number?

Binary fractional representation:

FractionalEncode(x) 1. x = 2 * x

2. If x < 1 output 0

3. else {output 1; x = x - 1; }

...

. b

1

b

2

b

3

b

4

b

5

...

2 2

2

2

1 2 2 3 3 4 4

1

b

b

b

b

01 . 3

/

1 

2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1  4/3 – 1 = 1/3

Incremental Generation

(32)

Which number do we encode?

Truncate the encoding to the first d

=

log2 (2/sn) bits Truncation gets a smaller number… how much smaller?

Truncation  Compression

d i

i d

i

i d i

i d i

bd

2 1 2 2 2 2

1 1

) ( 1

) (

2 2 2

log 2

log2 2 2

s n

sn ns

ln + sn

ln

ln + sn/2

...

...

.

1 2 3 4 5 1 2 3

b b b b b b

d

b

d

b

d

b

d

x 0

(33)

Bound on code length

Theorem: For a text T of length n, the Arithmetic encoder generates at most

log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2

(∏

i=1,n p(Ti)

)

= 2 - log2

(∏

[p()occ()

])

= 2 - ∑ occ() * log2 p()

≈ 2 + ∑ ( n*p() ) * log2 (1/p()) = 2 + n H(T) bits

T = acabc

sn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2

(34)

Document Compression

Dictionary-based compressors

(35)

LZ77

Algorithm’ s step:

Output <dist, len, next-char>

Advance by len + 1

A buffer “ window” has fixed length and moves

a a c a a c a b c a a a a a a Dictionary

(all substrings starting here)

<6,3,a>

<3,4,c>

a a c a a c a b c a a a a a a c a c

a c

(36)

LZ77 Decoding

Decoder keeps same dictionary window as encoder.

Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)

E.g. seen = abcd, next codeword is (2,9,e)

Simply copy starting at the cursor

for (i = 0; i < len; i++)

out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

(37)

LZ77 Optimizations used by gzip

LZSS: Output one of the following formats

(0, position, length) or (1,char)

Typically uses the second format if length <

Special greedy: possibly use shorter match so 3.

that next match is better

Hash Table for speed-up searches on triplets Triples are coded with Huffman’ s code

(38)

You find this at: www.gzip.org/zlib/

(39)

Google’s solution

Riferimenti

Documenti correlati

[r]

Design, synthesis, biological evaluation and X-ray structural studies of potent human dihydroorotate dehydrogenase inhibitors based on hydroxylated azole scaffolds..

beijerinckii strains with improved butanol production (i.e. titers up to 19 g/l) by random mutagenesis or rational metabolic engineering was reported [41, 45]. However, none of

A well- documented case of competition by an invasive alien spe- cies is the replacement of the native Eurasian red squirrel (Sciurus vulgaris) by the introduced eastern grey squirrel

On the other side, Cav3 labelling was predominantly localized at the plasma membrane in wtCav3 over-expressing cells, although the staining displayed a stronger intensity

Non è intenzione di questo lavoro arrivare a dare delle risposte certe che risulterebbero, dopo una lettura dello stesso, quanto meno inappropriate. Quello che, invece,

Finally, the results from preliminary finite element simulations are presented to support the discussion on the effect of surrounding soil on wave propagation in Section 5..

In addition, the comparison between the values of the objective functions shows that adding the second worker reduces the relocation costs, but this lessening is not so