Index construction:
Compression of documents
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79
Raw docs are needed
Various Approaches
Statistical coding
Huffman codes
Arithmetic codes
Dictionary coding
LZ77, LZ78, LZSS,…
Gzip, zippy, snappy,…
Text transforms
Burrows-Wheeler Transform
bzip
Basics of Data Compression
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Uniquely Decodable Codes
A variable length code assigns a bit string (codeword) of variable length to every
symbol
e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ?
A uniquely decodable code can always be uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in
which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0 1
a
b c
d 0
0 1 1
Average Length
For a code C with codeword length L[s], the average length is defined as
p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]
La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)
We say that a prefix code C is optimal if for all prefix codes C ’ , La(C) La(C’ )
S s
a
C p s L s
L ( ) ( ) [ ]
Entropy (Shannon, 1948)
For a source S emitting symbols with
probability p(s), the self information of s is:
bits
Lower probability higher information Entropy is the weighted average of i(s)
S
s p s p s
S
H ( )
log 1 )
( )
( 2
) ( log 1
)
(
2s s p
i
s s
s
occ T T
T occ
H | |
| log ) |
( 2
0
0-th order empirical entropy of string T
i(s)
0 <= H <= log ||
H -> 0, skewed distribution H max, uniform distribution
Performance: Compression ratio
Compression ratio =
#bits in output
/
#bits in inputCompression performance: We relate entropy against compression ratio.
p(A) = .7, p(B) = p(C) = p(D) = .1
H ≈ 1.36 bits
Huffman ≈ 1.5 bits per symb
|
|
| ) ( ) |
0(
T T vs C
T
H
s
s c s
p S
H( ) ( ) | ( ) |
Shannon Avg cw length In practice Empirical H vs Compression ratio
| ) (
| )
(
|
|T H0 T vs C T
An optimal code is surely one that…
Document Compression
Huffman coding
Huffman Codes
Invented by Huffman as a class assignment in ‘ 50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1) b(.2) c(.2) d(.5)
a=000, b=001, c=01, d=1
There are 2
n-1“equivalent” Huffman trees
(.3)
(.5)
(1)
What about ties (and thus, tree depth) ?
0 0
0
1 1 1
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.
Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.
a(.1) b(.2)
(.3) c(.2) (.5) d(.5) 0
0 0
1 1 1
abc... 00000101 101001... dcb
Huffman in practice
The compressed file of n symbols, consists of:
Preamble: tree encoding + symbols in leaves
Body: compressed text of n symbols
Preamble =
(|| log ||)
bitsBody is at least nH bits and at most nH+n bits
Extra +n is bad for very skewed distributions, namely ones for which H -> 0
Example: p(a) = 1/n, p(b) = n-1/n
There are better choices
T=aaaaaaaaab
Huffman = {a,b}-encoding + 10 bits
RLE = <a,9><b,1> = (9) + (1) + {a,b}- encoding
= 0001001 1 + {a,b}-encoding
So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.
Fax, bzip,… are using RLE
Idea on Huffman?
Goal: Reduce the impact of the +1 bit
Solution:
Divide the text into blocks of k symbols
The +1 is spread over k symbols
So the loss is 1/k per symbol
Caution: Alphabet = k, tree gets larger, and so
preamble. At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only.
This means no compression at all !
Document Compression
Arithmetic coding
Introduction
It uses “ fractional” parts of bits!!
Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman)
Used in JPEG/MPEG (as option), bzip
More time costly than Huffman, but integer implementation is not too bad.
Ideal performance.
In practice, it is 0.02 * n
Symbol interval
Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).
e.g.
a = .2 c = .3
b = .5
cum[c] = p(a)+p(b) = .7
cum[b] = p(a) = .2 cum[a] = .0
The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))
Sequence interval
Coding the message sequence: bac
The final sequence interval is [.27,.3)
a = .2 c = .3
b = .5
0.0 0.2 0.7 1.0
a c
b
0.2 0.3 0.55
0.7
a c
b
0.2 0.22 0.27 0.3
(0.7-0.2)*0.3=0.15
(0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.3=0.03
(0.3-0.2)*0.2=0.02
(0.7-0.2)*0.2=0.1 (0.7-0.2)*0.5 = 0.25
The algorithm
To code a sequence of symbols with probabilities
p
i (i = 1..n) use the following algorithm:0 1
0 0
l
s
ii i
i
i i
i
T cum s
l l
T p s
s
1 * 1
1
*
p(a) = .2 p(c) = .3
p(b) = .5 0.27
0.2 0.3
2 . 0
1 . 0
1 1
i
i
l s
03 . 0 3 . 0
* 1 .
0
i s
27 . 0 ) 5 . 0 2 . 0 (
* 1 . 0 2 .
0
i l
The algorithm
Each message narrows the interval by a factor
p[T
i]
Final interval size is
1 0
0 0
s l
ni
i
n
p T
s
1
i
i i
i i
i i
T p s
s
T cum s
l l
1
*
1 * 1
Sequence interval [ ln , ln + sn )
Take a number inside
Decoding Example
Decoding the number .49, knowing the message is of length 3:
The message is bbc.
a = .2 c = .3
b = .5
0.0 0.2 0.7 1.0
a c
b
0.2 0.3 0.55
0.7
a c
b
0.3 0.35 0.475
0.55
0.49 0.49
0.49
How do we encode that number?
If x = v/2
k(dyadic fraction) then the encoding is equal to bin(v) over k
digits (possibly padded with 0s in front)
0111 .
16 /
7
11 .
4 / 3
01 . 3
/
1
How do we encode that number?
Binary fractional representation:
FractionalEncode(x)
1. x = 2 * x
2. If x < 1 output 0
3. else {output 1; x = x - 1; }
...
. b
1b
2b
3b
4b
5...
2 2
2
2
1 2 2 3 3 4 41
b
b
b
b
01 . 3
/
1
2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3
Incremental Generation
Which number do we encode?
Truncate the encoding to the first d
=
log2 (2/sn) bits Truncation gets a smaller number… how much smaller?Truncation Compression
d i
i d
i
i d i
i d i
bd
2 1 2 2 2 21 1
) ( 1
) (
2 2 2
log 2
log2 2 2
s n
sn n s
ln + sn
ln
ln + sn/2
...
...
.
1 2 3 4 5 1 2 3 b b b b b b
db
db
db
dx 0
∞Bound on code length
Theorem: For a text T of length n, the Arithmetic encoder generates at most
log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)
= 2 - log2
(∏
i=1,n p(Ti))
= 2 - log2
(∏
[p()occ()])
= 2 - ∑ occ() * log2 p()
≈ 2 + ∑ ( n*p() ) * log2 (1/p()) = 2 + n H(T) bits
T = acabc
sn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2
Document Compression
Dictionary-based compressors
LZ77
Algorithm’ s step:
Output <dist, len, next-char>
Advance by len + 1
A buffer “ window” has fixed length and moves
a a c a a c a b c a a a a a a Dictionary
(all substrings starting here)
<6,3,a>
<3,4,c>
a a c a a c a b c a a a a a a c a c
a c
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length <
Special greedy: possibly use shorter match so 3.
that next match is better
Hash Table for speed-up searches on triplets Triples are coded with Huffman’ s code