Compression Techniques for Large Graphs: Theory and Practice

(1)

UNIVERSITÀ DIPISA Dipartimento di informatica

PH.D. THESIS

C

OMPRESSION

T

ECHNIQUES FOR

L

ARGE

G

RAPHS

:

T

HEORY AND

P

RACTICE

Luca Versari

SUPERVISOR Roberto Grossi Università di Pisa

REFEREE REFEREE

Travis Gagie Daniel Lemire

Dalhousie University Université TÉLUQ

REFEREE Kijung Shin

(2)

(3)

A

BSTRACT

In today’s world, compression is a fundamental technique to let our computers deal in an efficient manner with the massive amount of available information. Graphs, and in particular web and social graphs, have been growing exponentially larger in the last years, increasing the necessity of having efficient compressed representations for them. In this thesis, we study the compression of graphs from both a theoretical and a practical point of view. We provide a new technique to achieve better compression of sequences of large integers, showing its theoretical and practical properties, as well as new techniques for practical context modeling in large context spaces. We conduct a theoretical analysis of the compression of various models of graphs, showing that theoretically optimal compression for each of those models can be achieved in polynomial time. Finally, we show how to apply the proposed techniques to practical graph compression to obtain a new scheme that achieves significant size savings over the state of the art, while still allowing efficient compression and decompression algorithms.

(4)

A

CKNOWLEDGEMENTS

During the years of my PhD, I worked with many fantastic people, both in the University of Pisa and at Google, and I enjoyed the support of my family and many other people whom I did not work with. Thanks to everyone for their friendship, the interesting discussions, and the fun moments! Our interactions helped making my life a bit better every day.

Among those people, special thanks go to those who helped me improve this thesis; in no particular order, Alessio, Roberto, Thomas, Jyrki, Iulia, Robert, Jon.

(5)

C

ONTENTS

Contents 3

1 Introduction 7

1.1 First part: general-purpose compression . . . 8

1.2 Second part: graph compression . . . 9

1.3 Notation and definitions . . . 9

1.4 Published material . . . 11

I General-purpose compression 13 2 Compression concepts and techniques 15 2.1 Entropy of a random variable . . . 15

2.2 Entropy of a text. . . 17 2.3 Integer coding . . . 18 2.3.1 Unary coding . . . 18 2.3.2 Rice coding . . . 19 2.3.3 Elias γ coding . . . 19 2.3.4 Elias δ coding . . . 20 2.3.5 ζ codes . . . 20 2.3.6 π codes. . . 21 2.4 Entropy coding . . . 21

2.4.1 Prefix coding and Huffman coding . . . 22

2.4.2 Arithmetic coding . . . 23

2.4.3 Asymmetric Numeral Systems . . . 24

2.5 Higher order entropy and entropy coding . . . 25

(6)

3.1 Encoding scheme . . . 28

3.2 Analysis . . . 29

3.3 Experimental results . . . 33

4 Novel techniques for high-order entropy coding 43 4.1 Context clustering. . . 44

4.1.1 Heuristic algorithm. . . 46

4.2 Decision-tree-based context modeling . . . 47

4.2.1 Optimal algorithm for n=1, k >nt . . . 49

4.2.2 Optimal algorithm for nt=1 . . . 51

4.2.3 Heuristic algorithm in pseudolinear time for n=1 . . . 51

4.2.4 Heuristic for the general case . . . 52

II Graph compression 55 5 Common techniques for graph compression 57 5.1 Raw link encoding . . . 58

5.2 Grammar- and dictionary-based . . . 59

5.3 Class-tailored . . . 60

5.3.1 Compression of trees . . . 60

5.4 Tree-based . . . 62

5.5 Copying models . . . 64

5.5.1 Brief summary of WebGraph . . . 64

5.6 Decomposition . . . 65

5.7 Reordering . . . 66

6 Graph compression in theory 69 6.1 Erdös-Rényi . . . 69

6.2 Stochastic Block Model . . . 70

6.3 Uniform attachment . . . 71

6.4 Copy model . . . 74

6.5 Preferential attachment (Barabási-Albert) . . . 76

6.6 Simplified Copy Model . . . 78

7 Graph compression in practice 81 7.1 Encoding Integers . . . 82

(7)

Contents

7.2 Graph compression in Zuckerli . . . 83

7.2.1 Context management . . . 84

7.2.2 Choice of reference list and chain . . . 85

7.2.3 Full decompression . . . 86

7.2.4 List decompression . . . 86

7.2.5 Approximation guarantee . . . 87

7.2.6 Details on computing the optimal sub-forest of F . . . 87

7.3 Experimental results . . . 88

7.3.1 Datasets . . . 90

7.3.2 Parameter Choice . . . 90

7.3.3 Effect of Approximation Algorithm and Context Modeling . . . . 92

7.3.4 Compression Results and Resource Usage . . . 93

7.3.5 Performance Evaluation . . . 96

7.4 Further improvements on the Zuckerli scheme . . . 100

7.4.1 Tree-based context modeling . . . 100

7.4.2 Reference selection algorithm . . . 101

7.4.3 Experimental results . . . 102

8 Conclusions 105 8.1 Future work . . . 106

(8)

(9)

C

H

A

P

T

E

R

1 I

NTRODUCTION

D

ATA COMPRESSION is a process through which the number of bits used to represent information can be reduced. In today’s world, it is an incredibly important topic both for practical and theoretical pursuits.

Compression can be either lossless, when it allows to reconstruct the input exactly, or lossy, when it only allows to reconstruct an approximation of the input.

From the practical point of view, most of the services that we use today on the Internet would be significantly slower, if not entirely unfeasible, without using compression. Lossless compression, which is used by more than half the websites of the world [47], allows a 2−3x reduction of the amount of bytes for transmitting text content, layout and behaviour of webpages. Lossy compression of images achieves typically 10x savings over uncompressed data, and lossy compression of video can be even up to 1000x more efficient than uncompressed data, even with low information loss. It is thanks to compression, together with the improved processing power and connectivity that we enjoy in more modern times, that the Internet as we know it can exist.

From a theoretical point of view, compression is tightly tied with information theory, which tries to quantify the complexity of a given system. It is also for this reason that the capability of achieving good compression ratios on a given source has been connected with our understanding of the source, and the ability to compress of a system has been used as a measure of its degree of intelligence [73]. This connection between compression

(10)

and understanding has also been exploited in perhaps unexpected ways, such as as part of a measure of level of consciousness of patients in a coma or under the effects of medication [31].

On the other hand, graphs are one of the most versatile structures in mathematics and computer science, finding extremely varied applications, from answering questions on afternoon walks in Prussian towns [45] to building ranking algorithms for search engines [81] and doing community detection [37,38].

Graph theory is also an important topic in theoretical research, being a source of multiple deep results like a quasipolynomial time algorithm for graph isomorphism [9] and the famous graph-minor theorem [90].

Given the importance of both those topics, the interest in compression of graphs is self-explanatory. In this thesis, we will focus on the compression of large graphs:

• From a theoretical point of view, analyses will be conducted keeping in mind the asymptotic behaviour of the systems in analysis, and will be focused on aspects of the theory that are relevant to large real-world networks.

• From a practical point of view, the focus will be mostly on graph classes with hun-dreds of millions or billions of edges, such as web graphs and social networks [69]. After this chapter, this thesis is divided in two parts. The rest of this chapter contains an overview of the structure of those two parts, as well as notation and definitions that are used throughout the thesis.

1.1 First part: general-purpose compression

The first part of this thesis presents general techniques for data compression. These techniques will then be used in the second part of the thesis, and more specifically in Chapter7, to obtain a new graph compression method that significantly improves the state of the art.

Chapter2 provides an overview of some known techniques for general-purpose compression.

Chapter3introduces a novel integer coding scheme that uses both entropy coding and raw bits, also giving an analysis of its efficiency on common distributions and a comparison with other known techniques.

Chapter4 introduces novel techniques to achieve practical higher-order entropy coding.

(11)

1.2. Second part: graph compression The techniques of Chapters3 and4have been partially developed by the author during the development of JPEG XL [2] and partially for this thesis.

1.2 Second part: graph compression

This part of the thesis is dedicated to, specifically, compression of graphs. It provides both theoretical and practical novel results in the topic of graph compression.

Chapter5provides an overview of the state of the art of encoding schemes for large graphs.

Chapter6contains the main theoretical results on graph compression in this thesis, providing a novel theoretical analysis of various graph models. In particular, it gives lower bounds on the compressed size of graphs belonging to a specific model and polynomial-time compression algorithms that match, or almost match, the lower bounds. Special focus is given to models for sparse graphs. While the results of this chapter are interesting from a theoretical point of view, they are not necessary for the practical results in Chapter7.

Chapter7applies the techniques in Chapters3and4to graph compression, develop-ing a new compression scheme that achieves significant density improvements compared to the state of the art. Part of the results in this chapter have been published in [98]; Section7.4provides a compression scheme that achieves some further improvements compared to [98], at the cost of an increase in encoding and decoding time.

1.3 Notation and definitions

We denote the base-2 logarithm of a positive number x as log x, and its natural logarithm as ln x, unless otherwise noted. We will assume 0 log 0=0 and 00 =1 for convenience of notation.

We will refer multiple times to two families of probability distributions on natural numbers: the geometric and the Zipf distribution. We remark that the Zipf distribution is usually defined on a range of positive integers, i.e. {1, . . . , n}, while the distribution defined onN+_{is usually denoted as the Zeta distribution. We will use the two terms}

interchangeably.

Differently from usual, we define the geometric distribution as the distribution of the number of failures of a sequence of independent trials each with a probability p of failing before the first success. It follows that the probability of an integer k is given by pk₍₁₋_p₎_.

(12)

A Zipf distribution with parameter α is a distribution where the probability of an integer k is proportional to k−α_{. To ensure that the probabilities sum to 1, we normalize}

them using the Riemann ζ function, defined for all α>1:

ζ(α) = ∞

∑

n=1 1 nα

The probability of a given integer k under a Zipf distribution with parameter α is then k−α_ζ₍_α₎−1_.

A graph G is a pair(V, E)of vertices (or nodes) and edges, with E_⊆ V_×V. We will typically assume V to be of the form{1, . . . , n}. The degree of a node δvis the number

of edges that v belongs to. A sequence of distinct nodes n1, . . . , nkis a (simple) path if

(n_i, n_i+1)∈ E for all i.

A graph can be directed or undirected: in an undirected graph,(u, v)_∈ E_{⇔ (}v, u)_∈ E; undirected edges are also denoted as {u, v}. If a graph is directed, we make the distinction of in-degree (δ_v−_{) and out-degree (δ}+

v), the number of edges of which a node is

respectively the first or the second element.

The neighbours of a node v in an undirected graph, denoted by N(v), are all the nodes that share an edge with v. It follows that|N(v)_{| =}δv; in-neighbours (N−(v)) and

out-neighbours (N+₍_v₎_{) are defined in a similar way for directed graphs. An adjacency list}

for a node is the list of its (out-)neighbours.

A graph is connected if there is a path between any two nodes. A connected component of a graph is a maximal subset of nodes that is connected. An undirected graph is called a tree if there is a unique path between any two nodes. Any tree with n nodes has exactly n−1 edges. Nodes of degree 1 in a tree are called leaves.

A tree may be rooted in a node, called the root. In a rooted tree, we implicitly assume edges to be oriented away from the root, and in that case we define, for any edge(u, v), u to be the parent of v and v to be a child of u.

A forest if its connected components are trees; equivalently, if each pair of edges has at most one path between them. It follows that the number of edges of a forest of n vertices is at most n−1.

When the graph is clear from context and unless specified otherwise, we shall denote with n the number of nodes of a graph, and with m its number of edges.

When referring to graph representations, we will consider the following scenarios for access patterns.

• Full decompression: Decompress the representation entirely, obtaining the stan-dard representation of G.

(13)

1.4. Published material • List decompression: For any given node u∈ [n], decompress incrementally the

adjacency list of u, while keeping the rest compressed.

• Edge queries: For any given pair of nodes(u, v), determine whether(u, v)_∈ E or (u, v)_6∈E.

The above scenarios are listed from least to most flexible: indeed, it is always possible to support list decompression using n edge queries per list (although faster implementations might be possible), and it is possible to support full decompression using n list decompressions.

1.4 Published material

During the course of the PhD, more work was done that is not as tightly related to the topic of this thesis. For completeness, a full list of publications is reported here:

• Efficient algorithms for listing k disjoint st-paths in graphs. R. Grossi, A. Marino, L. Versari - Latin American Symposium on Theoretical Informatics, 2018

• Tight Lower Bounds for the Number of Inclusion-Minimal st-Cuts. A. Conte, R. Grossi, A. Marino, R. Rizzi, T. Uno, L. Versari - International Workshop on Graph-Theoretic Concepts in Computer Science, 2018

• Finding maximal common subgraphs via time-space efficient reverse search. A. Conte, R. Grossi, A. Marino, L. Versari - International Computing and Combina-torics Conference, 2018

• D2K: scalable community detection in massive networks via small-diameter kplexes. A. Conte, T. De Matteis, D. De Sensi, R. Grossi, A. Marino, L. Versari -Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018

• Listing subgraphs by Cartesian decomposition. A. Conte, R. Grossi, A. Marino, R. Rizzi, L. Versari - International Symposium on Mathematical Foundations of Computer Science, 2018

• Discovering k-Trusses in Large-Scale Networks. A. Conte, D. De Sensi, R. Grossi, A. Marino, L. Versari. IEEE High Performance extreme Computing Conference (HPEC), 2018

(14)

• Listing maximal subgraphs satisfying strongly accessible properties. A. Conte, R. Grossi, A. Marino, L. Versari. SIAM Journal on Discrete Mathematics 33, 2019 • JPEG XL next-generation image compression architecture and coding tools. J.

Alakuijala, R. van Asseldonk, S. Boukortt, M. Bruse, IM. Coms,a, M. Firsching, T. Fischbacher, E. Kliuchnikov, S. Gomez, R. Obryk, K. Potempa, A. Rhatushnyak, J. Sneyers, Z. Szabadka, L. Vandervenne, L. Versari, J. Wassenberg - Applications of Digital Image Processing XLII, 2019

• A fast discovery algorithm for large common connected induced subgraphs. A. Conte, R. Grossi, A. Marino, L. Tattini, L. Versari - Discrete Applied Mathematics 268, 2019

• Sublinear-Space and Bounded-Delay Algorithms for Maximal Clique Enumeration in Graphs A. Conte, R. Grossi, A. Marino, L. Versari - Algorithmica, 2020

• Benchmarking JPEG XL image compression. J. Alakuijala, S. Boukortt, T. Ebrahimi, E. Kliuchnikov, J. Sneyers, E. Upenik, L. Vandevenne, L. Versari, J. Wassenberg -Optics, Photonics and Digital Technologies for Imaging Applications VI, 2020 • Temporal coding in spiking neural networks with alpha synaptic function. IM.

Comsa, T. Fischbacher, K. Potempa, A. Gesmundo, L. Versari, J. Alakuijala - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020

• Truly Scalable K-Truss and Max-Truss Algorithms for Community Detection in Graphs. A. Conte, D. De Sensi, R. Grossi, A. Marino, L. Versari - IEEE Access, 2020 • Zuckerli: A New Compressed Representation for Graphs. L. Versari, IM. Comsa,

A. Conte, R. Grossi, IEEE Access, 2020

• Intelligent Matrix Exponentiation. T. Fischbacher, IM. Comsa, K. Potempa, M. Firsching, L. Versari, J. Alakuijala - to appear.

• ISO/IEC DIS 181811: Information technology JPEG XL Image Coding System -Part 1: Core coding system

(15)

P

ART

1 G

ENERAL

-

PURPOSE

COMPRESSION

(16)

(17)

C

H

A

P

T

E

R

2

C

OMPRESSION CONCEPTS AND

TECHNIQUES

E

NTROPYis a fundamental concept in data compression, giving a lower bound on the amount of bits necessary to represent a given piece of data. In this chapter, we define the concept of entropy of a random variable, and its application to data compression in the form of Shannon’s Source Coding Theorem [94].

We then discuss some well-known techniques for encoding a sequence of indepen-dent, identically distributed random variables (typically positive integers or symbols from a given alphabet), which is also known as order-0 compression.

Finally, we discuss the concept of higher-order compression, which achieves better results when compressing a sequence of non-independent random variables, as is often the case for e.g. text, or graphs.

2.1 Entropy of a random variable

Consider the process of flipping a biased coin that has a probability p of obtaining heads. We define the information content, or surprise, of obtaining heads on such a coin as

(18)

Note that the information content of “heads” decreases as p increases. This matches intuition, since it is less surprising to obtain heads on a coin with a probability of heads very close to 1.

We then define the information content of obtaining tails as I(tails) =₋log(1₋p)

We can now define the entropy of our coin flip as the expected amount of information content:

H(coin flip) =

∑

e∈{heads,tails}

Pr(e)I(e) =₋p log p_{− (}1₋p)log(1₋p)

Under this definition, the entropy of a coin with p =0 or p=1 is 0. This matches intuition as an event that is certain gives no information (or surprise).

We can generalize these definitions to any random event x and any random variable X.

Definition 1. The information content, or surprise, of an event x is defined as

I(x) =₋log Pr(x)

The entropy of a random variable X is the expected value of the information content of X. For a discrete variable, we have

H(x) =E[I(x)] =

∑

x∈X

Pr(x)I(x) =₋

∑

x∈X

Pr(x)log Pr(x)

Following the definition, we can compute the entropy for some common random variables.

• The entropy of a random variable that is uniformly distributed over n values is H(U_n) =log n.

• The entropy of a Bernoulli random variable with probability p is, as computed above,−p log p− (1−p)log(1₋p). We will denote this value as Hp.

• The entropy of a subset of n elements in an universe of m elements, chosen uni-formly at random, is log(m_n) =mHmn +O(log m)

The fundamental application of entropy to data compression is given by Shannon’s Source Coding Theorem [94]:

(19)

2.2. Entropy of a text

Theorem 2(Source Coding Theorem). With high probability, at least nH(X)bits are required

to represent a sequence of n i.i.d. random variables with probability distribution X.

One important property of entropy is that it is additive: the entropy of a sequence of independent random variables is equal to the sum of the entropies of each random variable.

In the rest of this chapter, we will refer to the redundancy of an encoding scheme:

Definition 3(Redundancy). The redundancy of a given encoding scheme for a specific source

is given by the ratio between the number of bits used by the encoding scheme and the lower bound given by the Source Coding Theorem, i.e. the entropy of the source.

It follows that the redundancy is always at least 1. We remark that this definition is somewhat different from the usual one, where the redundancy is the difference between the number of used bits and the entropy.

2.2 Entropy of a text

Here and in the rest of this thesis, when we mention a text we refer to a sequence of symbols T=t₀t₁. . . t_nof symbols from a given alphabetΣ.

To define the entropy for a text, we need to define a probability distribution for it. The simplest definition is that of order 0 entropy, in which we assume every symbol to be chosen independently of the others.

Typically, we assign to each symbol a probability equal to its frequency. Thus, if we denote with nsthe number of occurrences of s in T, we have

H0(T) =

∑

s∈Σ

nslog_nn s

However, this definition is not entirely satisfactory, as in most natural languages symbol choices are not independent: for example, in an English text, the sequence don’ is almost always followed by t.

The concept of higher order entropy models this property. In particular, to define the k-th order entropy of a text Hk, we consider every symbol to belong to a different

probability distribution depending on the previous k symbols.

Let’s consider, as an example, the text Tn = anbn. As a and b appear the same

number of times in Tn, each of them has a frequency of 0.5, and thus the order 0 entropy

(20)

H0(Tn) =0.5n+0.5n=n

For computing the order 1 entropy, we divide the symbols in Tnin 3 groups:

• The first symbol in the text: by definition, there is only one such symbol, which occurs with frequency 1 and thus provides no entropy.

• Symbols preceded by a a: there are n₋1 symbols a and one symbol b, for a total entropy of nH1n.

• Symbols preceded by a b: there are exactly n−1 symbols b, which occur with frequency 1 and thus provide no entropy.

The total order 1 entropy is thus

H1(Tn) =nHn1 ₌log n_{+ (}n₋1₎log n

n−1 =log n+O(1)

This value is significantly smaller than the order 0 entropy: this is because Tnhas a

very specific structure and symbol choices are very far from being independent.

2.3 Integer coding

It is common to encode integers that are independently produced with a given distribu-tion. Multiple encoding schemes have been produced, suited to various distributions; as a consequence of the Source Coding Theorem, a given encoding scheme that uses x(i)bits for an integer i is optimal for a distribution where i has a probability of 2−x(i), if ∑∞i=12−x(i) =1.

All the encodings described in this section will be described in the setting of encoding positive integers. If it is necessary to also encode 0, the number to be encoded can be increased by 1.

2.3.1 Unary coding

This is one of the simplest encodings: to encode x, a sequence of x−1 binary digits 1 is produced, followed by a single digit 0. It is also common to see the roles of 0 and 1 be swapped, which doesn’t change the characteristics of the encoding.

Encoding x thus uses x bits: this encoding is optimal for sources for which x has probability 2−x_{, i.e. a geometric distribution of parameter} 1

(21)

2.3. Integer coding

2.3.2 Rice coding

Rice coding can be seen as a generalization of unary coding. It is defined by a parameter M =2k. When k=0, this procedure defines the same encoding as unary coding.

In general, a number N is encoded by: • Encodingj₂Nkk in unary, and

• Encoding N mod 2k_{using exactly k bits.}

It follows that encoding N requiresjN 2k k

+k bits. In general, this encoding can only be optimal for sequences in which each group of consecutive 2k_{symbols is equiprobable.}

However, if we ignore the errors introduced by rounding, we can see that Rice coding for a given k is optimal ifj₂Nkk is distributed geometrically with p = 1₂. This property suggests that Rice coding is a good choice for geometrically distributed sources where each trial has probability of failure of 1

22−k.

Finally, we remark that it is possible to generalize Rice coding to the case where N is not a power of two. The resulting coding scheme is called Golomb coding.

2.3.3 Eliasγ coding

All the codes described so far are most suitable for geometric distributions. However, geometric distributions decrease in probability very quickly compared to many real-world data streams. Elias γ coding [42] is a code that addresses this issue, being suitable for numbers that follow a power-law distribution.

The Elias γ code for a positive integer N is defined as follow: • First,_blog N_{c +}1 is encoded in unary.

• Then, the lowestblog Ncbits of N are encoded directly.

It follows that this encoding uses 2blog Nc +1 bits to represent N; it is thus suitable for integers distributed in such a way that the probability of N is close to

2−2blog Nc−1_≈ 1

2N2

Thus, γ coding is suitable for integers following a Zipf distribution with exponent close to 2.

(22)

2.3.4 Eliasδ coding

The γ code [42] for an integer N is defined as follows: • First,blog Nc +1 is encoded using Elias γ coding. • Then, the lowest_blog N_cbits of N are encoded directly.

It follows that this encoding uses 2blogblog Ncc + blog Nc +1 bits to represent N; it is thus suitable for integers distributed in such a way that the probability of N is close to

2−2blogblog Ncc−blog Nc−1_≈ 1

2N log2N

This probability distribution decreases more slowly than any Zipf distribution (as the sum of N−1_{over the natural numbers diverges, and hence there is no Zipf distribution}

of exponent 1). However, Zipf distributions of exponent very close to 1 should be well represented by Elias δ coding.

2.3.5 ζ codes

Elias codes provide a good representation for Zipf distributions with exponents close to 1 and 2. To address the necessity of encoding Zipf distributions with intermediate exponents, such as ones that are common in web graphs that have an exponent of≈1.3, [23] introduces ζ codes.

To define this code, we first need to define the minimal binary code of an integer n∈ [0, z).

Let s=_dlog z_e. The minimal binary code of n is defined as

• If n<2s₋z, then n is represented as its binary representation using s₋1 bits. • If n≥2s₋_{z, then n is represented as the binary representation of n}₋_z₊₂s_using

s bits.

We remark that, if z=2d, this representation corresponds to usual binary representation on d digits.

ζ codes are parameterized by a shrinking factor k. To represent an integer N, let h be

such that N∈ [2hk_{, 2}(h+1)k₎_{. Then, the representation is given by}

• h+1 written in unary, and

(23)

2.4. Entropy coding When k=1, then h =_blog N_cand 2(h+1)k₋2hk=2h, making the minimal binary code correspond to the binary code on h digits. Hence, ζ1is the same as Elias γ coding.

In the general case, the ζkcode usesblog N/k+1c(k+1)orblog N/k+1c(k+1) +1

bits to encode an integer N, depending on which of the two different representations is used for the minimal binary code. This corresponds to an implied probability of approximately

Θ2−(log N/k+1)(k+1)₌_Θ N₋₍k+1)/k₌_Θ

1

N1+1_k

which suggests that ζ_k codes are good for integers distributed with a Zipf law with exponent 1+ 1_k. Plugging in k = 1 gives again the observation that Elias γ codes are good for Zipf distributions with exponent 2, as expected.

2.3.6 π codes

πcodes, introduced in [5], allow efficient encoding of Zipf distributions that have an

exponent even closer to 1 compared to ζ codes. Like ζ codes, they are also parameterized by a positive integer k.

The encoding is defined as follows. Let h=1+_blog N_c, and let l=l₂h_km. The code is given by

• l written in unary,

• 2k_l₋_{h written using exactly k bits, as it is a number in}_[_{0, 2}k₎_{, and}

• The least significantblog Ncbits of N.

The total number of bits used to represent N is thus k+l1+blog N₂_k cm+_blog N_c. This code is hence optimal for distributions where the probability of N is approximately

Θ2−log N/2k₊_{log N} =Θ

1

N1+2−k

which is the case for Zipf distributions of exponent α ≈ 1+2−k. Thus, π codes can encode efficiently Zipf distributions with exponents significantly closer to 1 for a similar value of k when compared to ζ codes.

2.4 Entropy coding

Entropy coding is a set of techniques that allows us to achieve compression close to the entropy bound for any probability distribution, as long as the distribution is known in

(24)

advance both on the encoding and on the decoding side of the communication. This typ-ically implies that the compressed representation should start with some representation of the probability distributions.

2.4.1 Prefix coding and Huffman coding

Prefix coding represents symbols from a given alphabetΣ as a sequence of bits, with the property that for any two distinct symbols s, t ∈Σ the sequence of bits that corresponds to s is not a prefix of the sequence of bits that corresponds to t. This property is usually referred to as the code being prefix-free.

The advantage of a prefix-free code is the simplicity of the decoding procedure: indeed, it is sufficient to read one bit at a time from the encoded stream, extending a sequence of bit values. As soon as the sequence of bit values corresponds to the sequence associated with a symbol, that symbol can be produced, and the sequence cleared: this is because there is no ambiguity between the sequence for a symbol or the prefix of the sequence for another symbol.

However, in practical implementations, usually the decoding procedure uses a table of size 2k_{, where k is the longest length of a sequence, that contains the first symbol to be}

decoded for each of the possible combinations, and decoding then proceeds by reading k bits at a time, looking up the symbol to be decoded, and then advancing the position in the encoded stream by the correct amount (possibly less than k).

Prefix coding is, by definition, stateless: it is possible to resume decoding from any symbol boundary in the bitstream, without the need of any extra information. Hence, prefix codes are well suited for streams that require random access to specific positions in the encoded data, as the bit position of the starting bit of the required symbol is sufficient to resume decoding.

There are multiple ways to construct a prefix code for a given distribution of symbols. Among those, the most well-known is certainly Huffman coding [58], which produces optimal prefix codes; note that this does not mean that Huffman coding produces an optimal encoding, but just that no prefix code can have lower cost. It can be described as follows:

• At the beginning,|Σ|single-node binary trees are created, one per symbol; we de-fine the size of each of those trees as the number of occurrences of the corresponding symbol.

(25)

2.4. Entropy coding contains a root with the two trees as left and right children; the size of the new tree is the sum of the sizes of the old trees.

• The process is repeated until there’s only one tree.

• The code associated with each symbol is given by the path from the root to the corresponding leaf, where each left branch is represented by a 0 bit and each right branch by a 1 (or vice-versa).

Other algorithms to build prefix codes are known. Among those, of particular note is [67], which provides an algorithm to build optimal length-limited prefix codes, i.e. prefix codes for which the code word for any symbol does not exceed a specified length.

2.4.2 Arithmetic coding

While prefix coding is more flexible than fixed integer codes, it uses an integer number of bits per symbol; it follows from the definition of entropy that it cannot be optimal unless₋log p_∈N, i.e. the probability of each symbol is a power of 2.

One possible method to overcome this limitation is arithmetic coding [84]. The main idea of arithmetic coding is to represent the sequence of symbols as a single number in [0, 1).

To this end, we proceed as follows:

• We initialize the current interval as[0, 1).

• We split the current interval into sub-intervals, with one interval per symbol, and length proportional to the probability psof each symbol s.

• We replace the current interval with the interval corresponding to the next symbol, and repeat from the previous step while there’s still symbols to encode.

• At the end, we pick any fraction with a denominator of the form 2k_{that is contained}

in the final interval; the encoding of the sequence is given by the numerator of this fraction, represented using k bits.

If we choose ps= n_ns, where nsis the number of occurrences of a symbol s and n is

the total number of symbols, the size of the final interval is given by

∏

s∈Σ

n_s n

(26)

Since any interval of size e contains a fraction with denominator 2−log e+1_{, it follows}

that the total number of bits used by arithmetic coding is at most (ignoring the cost of encoding the number of bit itself, which is at most an additional logarithmic term)

1−log

∏

s∈Σ n_s n ns =1+

∑

s∈Σ nslog_nn s

which matches the entropy bound of Section2.2up to one extra bit.

In practical implementations, it is of course not ideal to operate with arbitrary precision floating point numbers. Hence, probabilities are typically rounded to some arbitrary small value like 2−12_{, and the arithmetic coder is flushed periodically: whenever}

the interval becomes sufficiently small, some bits are written to the output stream and the interval is rescaled by the corresponding power of two. This allows to use arithmetic coding with integer arithmetic, instead of arbitrary precision floating point arithmetic. However, even a small interval does not always allow to fully determine the next bits to be written; it is possible to overcome this issue, but as arithmetic coding is not the focus of this thesis, we refer the reader to [77] for more detail about its practical implementation. The case in which|Σ| =2 is particularly simple and efficient to implement in practice, and is often referred to as binary arithmetic coding.

2.4.3 Asymmetric Numeral Systems

Like arithmetic coding, Asymmetric Numeral Systems, or ANS [41], encode a sequence s1, . . . , snof input symbols in a single number x that can be represented with a number

of bits that is close to the entropy of the data stream. However, compared to traditional methods of arithmetic coding it can achieve faster decompression speeds, at the cost of requiring the decoding process to obtain symbols in reverse order compared to the encoding process.

The encoding process adds a symbol s to the sequence represented by a number x by producing a new integer

C(s, x) =M x

Fs

+B_s+ (x mod F_s)

where M is the sum of the frequencies of all the symbols, Fs is the frequency of the

symbol s and Bsis the cumulative frequency of all the symbols before s. The inverse of

this function is

D(x) =s, F_sj x M

k

+ (x mod M)₋B_s

where s is such that Bs ≤x mod M <Bs+1. The decoder can thus reverse the encoding

(27)

2.5. Higher order entropy and entropy coding any division operation (except by M, which is usually a power of 2); moreover, s can be computed by a lookup in a precomputed table of M elements.

Like all variants of arithmetic coding, practical implementations of ANS do not use arbitrary precision arithmetic, but rather they keep an internal state in a fixed range [S, 2bS)that is manipulated for each symbol in the stream: when the state overflows, it yields b bits during encoding; when the state underflows, it consumes b bits when decoding. For correct decoding, it is required that S is a multiple of M. A reasonable choice is to set S = 216, M = 212, and b = 16. More details about practical ANS implementations can be found in [78].

Note that, since the encoding procedure is just the reverse of the decoding procedure, ANS makes it easy to interleave non-compressed and compressed bits. This is especially useful when encoding integers using both entropy coding and directly coded bits, as done in Chapter3.

2.5 Higher order entropy and entropy coding

As entropy coding requires communicating the probability distributions in advance, it quickly becomes unpractical for higher order entropy coding. For example, to encode with order-2 entropy coding a typical binary file (which has 256 distinct symbols), it would be necessary to communicate 65 536 distributions.

Two common workarounds for this issue are:

• To use other techniques that can exploit higher order correlations in the source data; for example, the Lempel-Ziv sliding window compression scheme [106] has been shown [103] to be asymptotically optimal even for data with higher order correlations. As another example, the usage of run-length encoding [91], that is, encoding a repetition count together with each encoded symbol, allows to get rid of the simplest correlations such as long sequences of repeated symbols. Finally, it was shown in [61] that applying the Burrows-Wheeler Transform also allows to compress its input within higher-order entropy bounds.

• To use adaptive probability models, that is, to transmit a small number of probability distributions that get modified in the same way during the encoding and decoding process. This is what is done for example in CABAC, in the H. 264 video coding standard [65].

(28)

(29)

C

H

A

P

T

E

R

3

H

YBRID

I

NTEGER

E

NCODING

H

YBRIDINTEGERENCODINGis a novel, general technique to encode integers with potentially very large values using a combination of entropy coding and raw bits. This scheme was initially developed in the context of the JPEG XL image compression format [2].

Directly applying entropy coding to encode integers from unbounded (or, in practice, even bounded with very large values) distributions is unfeasible, because it would require communicating an arbitrary distribution overN.

To circumvent this issue, a common idea is to have a single entropy-coded symbol represent multiple integers, and then use additional bits to disambiguate. This idea dates back to at least the JPEG standard [99].

To the best of our knowledge, Hybrid Integer Encoding is the first encoding scheme that is parameterized to allow different trade-offs between integer range reduction and precision of the encoding. Moreover, it is defined analytically, making it suitable for any range of integers, while previous schemes were defined case-by-case, making them usable only on a limited range. A similar technique is introduced independently in [78]; this scheme proceeds by representing the integer in a fixed base (say, base 16) and representing the first digits using entropy coding; the rest of the digits are then represented directly. Hybrid integer encoding is more flexible in that it allows increasing precision for smaller numbers, as well as specifying a number of least-significant bits

(30)

to be represented by entropy coding (for cases where i.e. integers to be encoded are all even). Moreover, the precision of the approximation is more uniform across the space of represented integers; more details and an experimental comparison can be found in Section3.3.

Hybrid Integer Encoding is defined by three parameters i, j and k, with k≥i+j and i, j_≥0. The integer encoding scheme initially described in [2] corresponds to the variant of Hybrid Integer Encoding with k=4, i=1, j=0.

In the rest of this chapter we describe the proposed scheme and analyze the perfor-mance of this method on various distributions of integers, comparing to the specific integer codes that have been described in Section2.3. For ease of implementation, we will restrict the probability distributions to the[0, 232)range.

3.1 Encoding scheme

The k parameter defines the number of direct symbols. Every integer in the range[0, 2k) is encoded directly as symbol in the alphabet.

Any other integer x _≥ 2k _{is encoded as follows. First, consider the binary}

repre-sentation of x: bpbp−1· · ·b1, where bp = 1 is the highest non-zero bit, and p its index

(hence p = _blog x_{c +}1). Identify x with its corresponding triple(m, t, l), where m is the integer formed by the i bits bp−1· · ·bp−i following bp, l is the integer formed by the

rightmost j bits bj· · ·b1, and t is the integer encoded by the bits between those of m and

l, as illustrated below: 1 m z }| { bp−1. . . bp−i t z }| { bp−i−1. . . bj+1 l z }| { bj. . . b1

Clearly, given the triple(m, t, l), we can reconstruct x. We conveniently encode that triple by a pair(s, t)where s =2k+ (p₋k₋1)_·2i+j+m_·2j+l encodes the value of p≥k+1 by(p₋k₋1)_·2i+j_{, the value of m as m}_·₂j_{, and the value of l. By adding 2}k

to s we guarantee that s≥2k_{, as values of s}_<₂k _{represent values of x}_<₂k_directly.

For example, for k=4, i =1, and j=2, the integer x=211 has binary representation 1 1 0100 11(with p=8) and its corresponding triple is(1, 4, 3); it is thus encoded as the pair(16+3_·8+1_·4+3, 4) = (47, 4). As another example, when k =4, i=1 and j =1, the integers from 0 to 15 are encoded with their corresponding symbol s in the alphabet, and t is empty; 23 has binary representation 10111 and thus is encoded as symbol 17 (the highest set bit is in position 5, the following bit is 0, and the last bit is 1), followed

(31)

3.2. Analysis

Algorithm 1:How to decode an(s, t)pair.

ifs<2k then returns; l_{← (}s₋2k₎ _{mod 2}j_; h← s−l−2k 2j ; m← h mod 2i_; n← h−₂im (note n= p−k−1); return2n+k+m_·2n+k−i+t_·2j+l;

by the two remaining bits 11; 33 is encoded as symbol 21 (highest set bit is in position 6, following bit is 0 and last bit is 1) followed by the three remaining bits 000.

As for t, it is stored as-is in the encoded stream, just after entropy coding s. Note that it is possible to compute the number of bits of t from s, without knowing x: this allows the decoder to know how many bits to read. The procedure to decode an integer from the (s, t) pair consists of recovering the corresponding triple (m, t, l)and then reconstructing x, and is detailed in Algorithm1.

Table3.1reports more values of s, t and the number of bits n of t for some combina-tions of k, i, j.

We remark that the number of entropy coded symbols is logarithmic in the range of the actual integers to encode; thus, it is necessary to store very few probability values even when encoding integers with a range of billions.

3.2 Analysis

We will now provide an analysis of the compression performance of the given encoding scheme, both theoretically and with experimental results, including comparisons with existing schemes.

Theorem 4. The redundancy of Hybrid Integer Encoding with parameters k, i, 0 on a stream of

i.i.d. random variables, where value v has probability pv, can be bounded from above by

r=max a≥k ( 1, max 0≤b<2i ( max v∈Ia,b a−i−log ∑v0_∈_I_a,bp_v0 −log pv0 )) where Ia,b= [2a+b·2a−i, 2a+ (b+1)·2a−i).

Proof. By the definition of Hybrid Integer Encoding, for any integer in I_a,bwith a _≥ k and 0≤b<2i, we obtain the same value of s, and t has a length of a−i bits.

(32)

4, 2, 0 4, 1, 1 2, 1, 0 2, 0, 2 s n t s n t s n t s n t 0 0 0 − 0 0 − 0 0 − 0 0 − 1 1 0 ₋ 1 0 ₋ 1 0 ₋ 1 0 ₋ 2 2 0 − 2 0 − 2 0 − 2 0 − 3 3 0 − 3 0 − 3 0 − 3 0 − 4 4 0 ₋ 4 0 ₋ 4 1 0 4 0 ₋ 5 5 0 − 5 0 − 4 1 1 5 0 − 6 6 0 − 6 0 − 5 1 0 6 0 − 7 7 0 ₋ 7 0 ₋ 5 1 1 7 0 ₋ 8 8 0 − 8 0 − 6 2 00 8 1 0 9 9 0 − 9 0 − 6 2 01 9 1 0 10 10 0 ₋ 10 0 ₋ 6 2 10 10 1 0 11 11 0 − 11 0 − 6 2 11 11 1 0 12 12 0 − 12 0 − 7 2 00 8 1 1 13 13 0 ₋ 13 0 ₋ 7 2 01 9 1 1 14 14 0 − 14 0 − 7 2 10 10 1 1 15 15 0 − 15 0 − 7 2 11 11 1 1 16 16 2 00 16 2 00 8 3 000 12 2 00 17 16 2 01 17 2 00 8 3 001 13 2 00 22 17 2 10 16 2 11 8 3 110 14 2 01 23 17 2 11 17 2 11 8 3 111 15 2 01 24 18 2 00 18 2 00 9 3 000 12 2 10 25 18 2 01 19 2 00 9 3 001 13 2 10 26 18 2 10 18 2 01 9 3 010 14 2 10 27 18 2 11 19 2 01 9 3 011 15 2 10 28 19 2 00 18 2 10 9 3 100 12 2 11 29 19 2 01 19 2 10 9 3 101 13 2 11 30 19 2 10 18 2 11 9 3 110 14 2 11 31 19 2 11 19 2 11 9 3 111 15 2 11 32 20 3 000 20 3 000 10 4 0000 16 3 000 33 20 3 001 21 3 000 10 4 0001 17 3 000 63 23 3 111 23 3 111 11 4 1111 19 3 111 64 24 4 0000 24 4 0000 12 5 00000 20 4 0000 65 24 4 0001 25 4 0000 12 5 00001 21 4 0000 127 27 4 1111 27 4 1111 13 5 11111 23 4 1111 128 28 5 00000 28 5 00000 14 6 000000 24 5 00000

(33)

3.2. Analysis Thus, the probability of the value s corresponding to integers in Ia,bwill be

∑

v∈Ia,b pv

It follows that encoding an integer in I_a,bwill use a total number of bits given by a−i−log

∑

v∈Ia,b pv

For values below 2k_{, s is sufficient for encoding the integer, so the number of used}

bits will be−log pv.

Overall, the expected number of bits used by hybrid integer encoding is given by B=

∑

v<2k −pvlog pv+

∑

a≥k0≤

∑

b<2i_v_∈

∑

_I_a,b pv a−i−log

∑

v0_∈_I_a,b pv0 ! ≤

∑

v<2k− pvlog pv+

∑

v≥k− rpvlog pv ≤r

∑

v −pvlog pv

which proves the theorem.

Corollary 5. For j > 0, the redundancy can be bounded by the maximum redundancy of 2j

streams of i.i.d. variables, each coded with a Hybrid Integer Code with parameters k−j, i, 0 and containing the integers equal to_{0,_{· · ·} , 2j₋₁_}_{modulo 2}j_.

Using Theorem4, we can give an upper bound for the redundancy of Hybrid Integer Encoding on a geometric distribution with parameter 0 < α < 1. We recall that in a geometric distribution pv= αv(1−α). We first compute

∑

v∈Ia,b pv=

∑

v∈Ia,b αv(1₋α) = (1₋α)α2a+2a−ib

∑

v<2a−i αi = (1₋α2a−i)α2a+2a−ib

As pvis decreasing, the maximum over Ia,bis achieved on v =2a+2a−ib.

Thus, we can rewrite the formula in Theorem4as

a−i− (2a₊₂a−i_b₎_{log α}₋_log₍₁₋_α2a−i ) −(2a₊₂a−i_b₎_{log α}₋_log₍₁₋_α₎

(34)

Observing that log(1₋α)_≤log(1₋α2a−i)_≤0, and that log α<0, thus making the denominator minimum when b=0, we can bound the given formula from above by

1+ (a−i)2−

a

log1

α

To compute the maximum of this value, we first differentiate(a₋i)2−a, observing that its maximum occurs in i+log e. Thus, the maximum over integers will occur either for a = i+1 or a = i+2, which both yield 2−(i+1)_{; moreover, as a increases beyond}

i+2, the value of(a₋i)2−adecreases. As a ≥k, we can state the following

Theorem 6. The redundancy of Hybrid Integer Encoding with parameters k, i, 0 when encoding

a geometric distribution with parameter α is at most rα,k,i≤      1+2−(i+1) log1 α if k≤i +2 1+(k−_logi)21−k α otherwise

We now consider the case of a generic decreasing probability distribution. If we define ˆpa,b= p2a₊₂a−i_band ˇp_a,b = p₂a₊₂a−i₍_b₊₁₎₋₁, it follows that

∑

v∈Ia,b

pv≥2a−iˇpa,b

Thus, we can state

a decreasing distribution is at most

max

a≥k 0≤b<2i

log ˇpa,b

log ˆpa,b

Applying this theorem, we can also obtain a bound for Zipf distributions:

a Zipf distribution with parameter α is at most

rα,k,i ≤1+ log(1+2

−i₎

(35)

3.3. Experimental results Proof. We have ˆpa,b = (2a+2a−ib)−αζ(α)−1 and ˇpa,b = (2a+2a−i(b+1)−1)−αζ(α)−1 Thus, max a≥k 0≤b<2i log ˆpa,b

log ˇp_a,b = maxa≥k 0≤b<2i αlog(2a+2a−i(b+1)₋1) +log ζ(α) αlog(2a+2a−i_b_{) +}_{log ζ}₍_α₎ = = max a≥k 0≤b<2i 1+αlog(2 a₊₂a−i₍_b₊₁₎₋₁₎₋_α_log₍₂a₊₂a−i_b₎ αlog(2a+2a−i_b_{) +}_{log ζ}₍_α₎ = =1+ max a≥k 0≤b<2i log2a+2a−i(b+1)₋1 2a₊₂a−i_b log(2a+2a−ib) + log ζ_α(α) = =1+ max a≥k 0≤b<2i log1+₂2_aa₊−₂i−_a₋1_i_b log(2a+2a−i_b_{) +} log ζ(α) α = =1+max a≥k log1+ 2a−₂ia−1 log 2a₊ log ζ(α) α = =1+max a≥k log 1+2−i₋2−a a+log ζ_α(α) ≤ ≤1+log(1+2− i₎ k+log ζ_α(α)

We remark that the bounds obtained here are not especially tight: for example, for k =4, i = 2 and α= 2, they predict a redundancy of about 7.3%, while the results of the experimental evaluation show an actual redundancy of about 0.6%. Nonetheless, these bounds serve the purpose of showing that it is possible to achieve redundancy below any constant greater than 1 with the Hybrid Integer Encoding scheme for these two distributions.

3.3 Experimental results

We now report the results of an experimental comparison of the Hybrid Integer Encoding scheme on a sequence of 107_{i.i.d. integers with geometric (in Figures}_3.1_,_3.2_and_3.3₎

(36)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 3 6 p redundancy

hyb-Huff hyb-ANS unary Rice 1

Rice 2 Rice 3 Rice 4 Rice 5

Figure 3.1: Redundancy of Hybrid Integer Encoding, compared to Rice codes, on a geometric distribution for various values of p.

(37)

3.3. Experimental results 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 3 6 p redundancy ζ2 ζ3 ζ4 ζ5 ζ6

hyb-Huff hyb-ANS Elias γ Elias δ

Figure 3.2: Redundancy of Hybrid Integer Encoding, compared to ζ and Elias codes, on a geometric distribution for various values of p.

(38)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 4 8 3 6 p redundancy π1 π2 π3 π4 π5

Figure 3.3: Redundancy of Hybrid Integer Encoding, compared to π and Elias codes, on a geometric distribution for various values of p.

(39)

3.3. Experimental results 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 1 2 4 8 3 6 α redundancy

hyb-Huff hyb-ANS unary Rice 1

Rice 2 Rice 3 Rice 4 Rice 5

Figure 3.4: Redundancy of Hybrid Integer Encoding, compared to Rice codes, on a Zipf distribution for various values of α.

(40)

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 1 2 4 8 3 6 α redundancy ζ2 ζ3 ζ4 ζ5 ζ6

Figure 3.5: Redundancy of Hybrid Integer Encoding, compared to ζ and Elias codes, on a Zipf distribution for various values of α.

(41)

3.3. Experimental results 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 1 2 4 8 3 6 α redundancy π1 π2 π3 π4 π5

Figure 3.6: Redundancy of Hybrid Integer Encoding, compared to π and Elias codes, on a Zipf distribution for various values of α.

(42)

and Zipf (in Figures3.4,3.5and3.6) distributions respectively, compared with Rice, ζ,

π and Elias γ and δ codes. The distributions were cut at the threshold of 232, i.e. no

number higher than 232_{was generated.}

The Hybrid Integer Encoding scheme was combined with either Huffman coding or with ANS coding for encoding the symbols, using as parameters k=4, i=2, j=0.

Hybrid Integer Encoding combined with ANS always has redundancy extremely close to 1, thanks to the possibility of using fractional bits per symbol.

On the other hand, Huffman-based Hybrid Integer Encoding can have higher re-dundancies for very skewed distributions, but has always the lowest redundancy of all the encoding schemes that use an integer number of bits per value. In particular, it always matches unary and Rice coding those geometric distributions where the codes are optimal.

We also measured the decoding speed of hybrid integer encoding, comparing against the γ and δ coding implementations inhttps://github.com/jacobratkiewicz/ webgraph. We found that the extra computation required does not significantly affect

decoding speed, as for all the integer coding methods the single-threaded decoding speed is between 6 and 8 nanoseconds per integer on a 32-core AMD 3970X CPU (with SMT enabled) with 256GB of RAM.

A possible implementation of Hybrid Integer Coding is provided in Figure3.7. This implementation assumes that classes for reading and writing both raw and entropy-coded bits are available.

It can easily be seen that the decoding process just requires simple arithmetic, shifts and bit reading, which are readily available on most modern processors. On more constrained environments, such as low-powered microcontrollers that do not have a barrel shifter, we expect that implementing hybrid integer decoding might be more problematic; nonetheless, such microcontrollers are becoming less and less common.

As for the implementation of the encoder, the situation is similar to the decoder, except for the additional required FloorLog2 operation; this operation is once again readily available on most modern CPUs (in particular x86 and arm), and can be easily emulated otherwise.

Finally, we compare Hybrid Integer Coding with the method proposed in [78]. For this comparison, we will use base 16 and we will encode just the most significant digit in base 16 using entropy coding, as is done in [78]. This corresponds to, on average, encoding≈2.26 bits below the most significant bit with entropy coding (0 bits when the first digit is 1, 1 bit for 2₋3, 2 bits for 4₋7 and 3 bits for 8₋15); accordingly, the closest corresponding Hybrid Integer Coding setting is 4, 2, 0, which fully entropy encodes

(43)

3.3. Experimental results

class HybridIntegerCoder {

public:

void Encode(uint32_t value, BitWriter_* writer) const {

if (value < (1<<k)) {

writer->EntropyEncode(value); } else {

uint32_t n = FloorLog2(value);

uint32_t m = value - (1 << n);

uint32_t high_bits = m >> (n - i);

uint32_t low_bits = m & ((1 << j) - 1);

uint32_t token = (1<<k) + ((n - k) << (i + j)) +

(high_bits << j) + low_bits;

uint32_t nbits = n - i - j;

uint32_t bits = (value >> j) & ((1UL << *nbits) - 1); writer->EntropyEncode(token);

writer->Write(nbits, bits); }

}

uint32_t Decode(BitReader_* br) const {

uint32_t token = br->EntropyDecode();

if (token < (1<<k)) return token;

uint32_t nbits = k - (i + j) +

((token - (1<<k)) >> (i + j));

uint32_t low_bits = token & ((1 << j) - 1); token >>= j;

uint32_t bits = br->ReadBits(nbits);

uint32_t high_bits = (1 << i) | (token & ((1 << i) - 1));

return (((high_bits << nbits) | bits) << j) | low_bits; }

uint32_t k = 4;

uint32_t i = 2;

uint32_t j = 0;

};

Figure 3.7: A possible implementation of Hybrid Integer Coding.

numbers up to 16 and entropy codes the highest two bits after the most significant one for larger numbers.

Results are reported in Figures3.8and3.9. The two schemes result in very similar performance; however, Hybrid Integer Coding has better performance (up to 0.6−0.8% better) on geometric distributions with small p, and worse performance (up to 0.3%) for Zipf distributions with very large α, where the effect of the larger average number of entropy coded bits is more pronounced.

(44)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.992 0.994 0.996 0.998 1.000 p ratio Huffman ANS

Figure 3.8: Ratio of encoded size between Hybrid Integer Encoding and the method proposed in [78], on a geometric distribution for various values of p. Values smaller than 1 correspond to a smaller size for Hybrid Integer Coding.

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 0.998 0.999 1.000 1.001 1.002 1.003 α ratio Huffman ANS

Figure 3.9: Ratio of encoded size between Hybrid Integer Encoding and the method proposed in [78], on a Zipf distribution for various values of α. Values smaller than 1 correspond to a smaller size for Hybrid Integer Coding.

(45)

C

H

A

P

T

E

R

4 N

OVEL TECHNIQUES FOR

HIGH

-

ORDER ENTROPY CODING

T

O MAKE HIGH-ORDER ENTROPY CODING MORE PRACTICAL, we propose two techniques which, to the best of our knowledge, have not been explored in the literature: context clustering and decision-tree-based context modeling.

In this chapter, we will denote context space (C) the space of all the possible contexts for a symbol,Σ the set of all possible symbols,PΣthe set of all the probability distributions

onΣ, and c(s)the context associated with a given symbol s.

We will also make the simplifying assumption that encoding the probability of a given symbol requires a fixed number of bits b, and thus encoding a probability distribution requires|Σ|b bits. In most real-world encoding schemes, this is not the case, but the rest of the chapter is substantial unchanged even with more complex distribution cost models; thus, for simplicity of exposition, we will use this trivial cost model.

The problem of communicating the probability distributions of each symbol can be seen as the problem of encoding a function d : _{C → P}_Σ; in the most general case, corresponding to trivial order-k entropy coding, encoding d requires O(_{|C| · |}Σ_{| ·}b)bits. In the example from Section 2.5, order-2 entropy coding of a binary file, Σ = {0, . . . , 255},C = Σ2and c(s)is the pair formed by the two bytes preceding s. Assuming b=12, as is often the case when using arithmetic coding, the cost of transmitting the

(46)

dis-0 1 2 0 2 4 6 8 symbol count (a) h(a), H(a)≈19.30. 0 1 2 0 2 4 6 8 symbol count (b) h(b), H(b)≈17.35. 0 1 2 0 2 4 6 8 10 symbol count (c) h(a, b), H(a, b)_≈42.29.

Figure 4.1: An example of histograms for the two context values a and b on the alphabet {0, 1, 2}.

tributions in the most trivial way would be approximately|Σ|2· |Σ| ·12=201 326 592_≈ 25 megabytes, making order-2 entropy coding unfeasible for files that are not extremely large.

4.1 Context clustering

The idea behind context clustering is simple: as distributions are significantly expensive, we can reduce the number of distributions to encode by splitting d in two parts:

cl :C → {0, . . . , k−1} dc :{0, . . . , k−1} → PΣ

(47)

4.1. Context clustering

Algorithm 2:Context clustering algorithm. Returns an array of context clusters.

cm ←c∈ C such that H(c)is maximum;

S ← {cm};

while|S| <min(k,_|C|)do dx←minc∈SD(h(c), h(x));

cm ←c∈ C \ S such that dcis maximum;

S ← {cm} ∪ S;

G ← [{c}for c∈ S];

foreachc∈ C \ S do

m_←i such that D(h(c), h(_G[i]))is minimum; G[i]_{← G[}i]_{∪ {}c_};

return_G;

The values 0, . . . , k−1 effectively correspond to a cluster of context values, and distributions are assigned to clusters instead of specific context values. It is easy to see that the cl function can be encoded using O(_{|C| ·}log k)bits, and the dc function requires O(k_{· |}Σ_{| ·}b)bits. Thus, this scheme achieves a significant reduction of distribution cost when|C|is large (as typically log k |Σ| ·b).

We will denote with h(c), for c _{∈ C}, the histogram of values x such that c(x) = c. We denote by h(a₁, . . . , a_n)the histogram obtained by pointwise adding the counts of all the symbols in h(a₁), . . . , h(a_n); finally, we will denote by H(a₁, . . . , a_n)the cost in bits of encoding all the symbols that have a1, . . . , anas a context, i.e. the entropy of the

distribution defined by h(a₁, . . . , a_n)multiplied by its total symbol count.

As an example, Figure4.1shows histograms for two distinct context values a, b on a 3-symbol alphabet. We have that

H(a) =8 log14 8 +4 log 14 4 +2 log 14 2 ≈19.30 H(b) =2 log13 2 +3 log 13 3 +8 log 13 8 ≈17.35 H(a, b) =10 log27 10 +7 log 27 7 +10 log 27 10 ≈42.29

Once cl is known, how to choose dc is obvious: it is simply, for each cluster, the histogram obtained by merging the histograms that belong to the cluster.

(48)

4.1.1 Heuristic algorithm

We will now present a heuristic algorithm to choose cl for a fixed value of k, with running time O(_|Σ_{| · |C| ·}k).

We first define a distance between two histograms h(A), h(B)as follows: D(h(A), h(B)) = H(A_∪B)₋H(A)₋H(B)_≥0

Clearly, D can be computed in O(Σ)time. From the example above, we can see that D(h(a), h(b)) =H(a, b)₋H(a)₋H(b)_≈5.64.

The algorithm, detailed in Algorithm2, proceeds with the following steps: 1. Pick the element c1fromC such that H(c1)is maximum.

2. Pick the element ci fromC \ {c1, . . . , ci−1}such that minj<iD(h(ci), h(cj))is

maxi-mum.

3. Repeat the previous step until k elements have been chosen.

4. Assign each element c_{∈ C} to cluster i if D(c, c_i)is minimum among the choices of i; replace ciwith ci∪ {c}.

As described, the second step of the algorithm needs to evaluate O(k2_|C|)times the function D. However, as the minimum is computed over an increasing set that has maximum size k, it is easy to see that only O(k_|C|)evaluations are necessary. Indeed, we can keep track, for every c∈ C, of the minimum of D(h(c), h(c_i))for every selected ci, and compute the distance of every c ∈ C from each newly selected histogram and

update the minimum if necessary.

The algorithm described here is reminiscent of a deterministic version of the ini-tialization step of k-means++ [6]. However, the D function is not one of the common distance metrics (||x−y||l) that are known to guarantee good results for k-means++, and the problem being solved is not quite equivalent to k-means clustering. Nonetheless, given the similarities of the two problems, it seems reasonable to state the following conjectures:

Conjecture 9. It is NP-complete to decide whether there exist two functions cl and dc such that

• The cost of encoding the input text using(dc_◦cl)(c(x))as the distribution for encoding x is below a threshold t.

(49)

4.2. Decision-tree-based context modeling

x2 >0

x3 >10 x1 >50

x1>64 ctx 3 ctx 2 x1>18

ctx 1 ctx 0 ctx 1 ctx 0

Figure 4.2: An example of a context tree CT. Dashed arrows correspond to branches that are taken if the condition is false; xncorresponds to the value of the n-th dimension of

the context value. Hence, the root of the tree in the figure will choose the left child if dimension 2 of the context value is strictly positive, and the right child otherwise.

Conjecture 10. There is a randomized scheme for selecting a new element fromCthat achieves

in expectation a O(log k)approximation of the true optimum of context clustering.

The second conjecture follows from the fact that the initialization step of k-means++ is a O(log k)approximation, as shown in [6].

4.2 Decision-tree-based context modeling

WhenC is very large, encoding the cl function can still be prohibitively expensive. We now present another representation of the cl function that has the interesting character-istic of not requiring space proportional to_|C|.

This technique assumes, as is often the case, that_Cis a product of intervals of integers I1× · · · ×In; this then produces a decomposition of c∈ C given by c= (x1, . . . , xn).

We define a binary tree CT where: