• Non ci sono risultati.

Data Compression

N/A
N/A
Protected

Academic year: 2021

Condividi "Data Compression"

Copied!
15
0
0

Testo completo

(1)

Data Compression

Dictionary-based compressors

(2)

LZ77

Algorithm’s step:

Find the longest copy, that starts before

Output <dist, len, next-char>

Advance by len + 1

We could define buffer “window”, that moves

a a c a a c a b c a a a a a a Dictionary

(all substrings starting here)

<6,3,a>

<1,4,c>

a c

a a c a a c a b c a a a a a a ca c

(3)

a a c a a c a b c a b a a a c (3,4,b) a a c a a c a b c a b a a a c (1,1,c) a a c a a c a b c a b a a a c

Example: LZ77 with window

a a c a a c a b c a b a a a c (0,0,a)

a a c a a c a b c a b a a a c Window size = 6

Longest match Next character a a c a a c a b c a b a a a c

c

a a c a a c a b c a b a a a c

a a c a a c a b a b a a a c (3,3,a) a a c a a c a b c a b a a a c (1,2,c)

within W

(4)

LZ77 Decoding

Decoder keeps same dictionary window as encoder.

Finds substring and inserts a copy of it

What if len > dist? (overlap with text to be compressed)

E.g. seen = abcd, next codeword is (2,9,e)

Output is correct:

abcd

abcdcd

abcdcdcd

abcdcdcdcdcdce

for (i = 0; i < len; i++)

out[cursor+i] = out[cursor-d+i]

(5)

LZ77 Optimizations used by gzip

LZSS: Output one of the following formats

(0, position, length) or (1,char)

Typically uses the second format if length <

3.

a a c a a c a b c a b a a a c (1,a) a a c a a c a b c a b a a a c (1,a)

a a c a a c a b c a b a a a c (0,3,4) a a c a a c a b c a b a a a c (1,c)

(6)

LZ77 Optimizations used by gzip

Hash Table for speed-up searches on triplets

Keep in a HT all triplets of chars in the Window

Search the current triplet in HT

Scan all matchets and try to extend them

Triplets: <aac:1,4><aca:2,5><caa:3,9>…

Every shift needs to update the HT

a a c a a c a b c a a a a a c ca a

(7)

LZ78

Dictionary:

substrings stored in a trie (each has an id).

Coding loop:

find the longest match S in the dictionary

Output its id and the next character c after the match in the input string

Add the substring Sc to the dictionary

Decoding:

builds the same dictionary and looks at ids

Possibly better for cache effects

(8)

LZ78: Coding Example

a a b a a c a b c a b c b (0,a) 1 = a Dict.

Output

a a b a a c a b c a b c b (1,b) 2 = ab a a b a a c a b c a b c b (1,a) 3 = aa a a b a a c a b c a b c b (0,c) 4 = c a a b a a c a b c a b c b (2,c) 5 = abc a a b a a c a b c a b c b (5,b) 6 = abcb

a b a a c a b c a b c b (0,empty) a

Longest match Next character

(9)

LZ78: Decoding Example

(0,a) a 1 = a

a a b

(1,b) 2 = ab

a a b a a

(1,a) 3 = aa

a a b a a c

(0,c) 4 = c

a a b a a c a b c

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

Input Dict.

(10)

LZW (Lempel-Ziv-Welch) [‘84]

Don’t send extra character c, but still add Sc to the dictionary.

The dictionary is initialized with byte values being the first 256 entries (e.g. a = 112,

ascii), otherwise there is no way to start it up.

The decoder is one step behind the coder since it does not know c

There is an issue for strings of the form

SSc where c= S[0], and these are handled specially!!!

(11)

LZW: Encoding Example

a a b a a c a b c a b c b 112 256=aa Dict.

Output

a a b a a c a b c a b c b 257=ab a a b a a c a b c a b c b 113 258=ba a a b a a c a b c a b c b 256 259=aac a a b a a c a b c a b c b 114 260=ca a a b a a c a b c a b c b 257 261=abc

112

a a b a a c a b c a b c b 260 262=cab

Longest match Next character (included in dict-phrase, but not in compressed-phrase)

(12)

LZW: Decoding Example

112 a

256=aa a a

257=ab a a b

113

258=ba a a b a a

256

259=aac a a b a a c

114

260=ca a a b a a c a b

257

261=ab?

112

a a b a a c a b ? 261

Input Dict

one step later

a a b a a c a b a b ? 261=aba a a b a a c a b a b a

(13)

LZ78 and LZW issues

How do we keep the dictionary small?

Throw the dictionary away when it reaches a certain size (used in GIF)

Throw the dictionary away when it is no longer effective at compressing (

e.g.

compress

)

….

(14)

Lempel-Ziv Algorithms

Keep a “dictionary” of recently-seen strings.

The differences are:

How the dictionary is stored

How it is extended

How it is indexed

How elements are removed

How phrases are encoded

LZ-algos are asymptotically optimal, i.e. their

compression ratio goes to Entropy(T) for n

!!

No explicit

frequency estimation

(15)

You find this at: www.gzip.org/zlib/

Riferimenti

Documenti correlati

Since the late 1960s, grassroots movements, associations, committees, political groups, antimilitarists and ecologists have struggled against land occupation by NATO and

 Throw the least-recently-used (LRU) entry away when it reaches a certain size (used in BTLZ, the British Telecom standard)..

 Throw the least-recently-used (LRU) entry away when it reaches a certain size (used in BTLZ, the British Telecom standard)..

Use the 2-gram index to rank all lexicon terms according to the number of matching 2-grams (moon).

Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over

 The search in the hash table of D2 now checks the strong- hash match to return the original strings in D1 (that need a post-filter step).. This induces further False Positives, but

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their creator with certain inalienable rights, that among these are life, liberty,

A phase 1 study of lenvatinib, multiple receptor tyrosine kinase inhibitor, in Japanese patients with advanced solid tumors.. [38] Cabanillas ME, Schlumberger M, Jarzab B,