On the use of optimization techniques for designing novel data compression and indexing schemes

(1)

UNIVERSITÀ DI PISA

DIPARTIMENTO DI

INFORMATICA

DOTTORATO DI RICERCA IN

INFORMATICA

SETTORESCIENTIFICODISCIPLINAREINF/01

PH. D. TH E S I S

O n t h e u s e o f o p t i m i z a t i o n t e ch n i q u e s

f o r d e s i g n i n g n ov e l d a t a c o m p r e s s i o n

a n d i n d ex i n g s ch e m e s

Andrea Farruggia

SUPERVISOR Paolo Ferragina SUPERVISOR Rossano Venturini REVIEWER Gad M. Landau REVIEWER Gonzalo Navarro October 2016

(2)

(3)

Introduction

The last few years have seen an exponential increase, driven by many disparate fields such as big data analytics, genomic technologies or even high-energy physics, of the amount of data that must be stored, accessed and analyzed. The possibility offered to the users by social media networks of easily creating and publishing contents gave rise to platforms managing hundred of petabytes, or the appearance of sequencers capable of producing terabytes of data at an inexpensive price, made possible by advancements in DNA sequencing technology, opened the road to fields such as pan-genomic analysis where common motifs have to be found in hundreds of genomes that must be first sequenced and indexed. Many of these developments, which operate on the data in an on-line fashion, have strict performance re-quirements. For example, web indexes must be capable of retrieving any indexed webpage in less than a millisecond on average in order to meet operational standards. This represents an incentive to store as much data as possible on main memory, instead of secondary storage like hard disks or solid-state disks. In fact, since accessing the RAM has a latency of around 100nanoseconds (or 100 · 10−9 seconds), while accessing the disk has a latency of few mil-liseconds (or ∼ 10−3 seconds), even allowing just 1% of the memory references to be read from disk implies a slowdown of a factor of about 100.

The most effective way of fitting more data in memory is to make use of data compression techniques. However, because of the on-line fashion these applications operate, these tech-niques must allow to operate over the data with an efficiency that is comparable to that of storing data in an uncompressed format. The kind of techniques that must be used to allow good compression ratios and fast data access depends on the way the data is accessed and on the kind of operations that must be performed on it. Applications can be grouped into two broad categories, depending on the nature of these operational requisites:

1. applications that organize data into files that must be accessed on their entirety. Ex-amples of this pattern are high-performing distributed storage systems like Google’s BigTable[CDG+08], a distributed key-value database where each value is composed of chunks of 64MiB data. Applications in this scenario, also known as “compress once, decompress many times”, need a good trade-off between compression ratio and decom-pression speed.

(6)

CO N T E N T S

2. applications that need to perform sophisticated queries on their dataset. Examples of this class of applications are web indexes, where a search query needs to fetch only those documents containing a user-supplied list of words, or DNA aligners, where portions of the text that match a given pattern must be found. In this kind of applications, only a small fraction of the data should be accessed, so the compressed representation should allow for some kind of direct access without requiring to decompress the whole text, which is costly.

Theoretical solutions for these needs abound in the literature. For “compress once, decom-press many times” scenarios, there are comdecom-pressors based either on the Lempel-Ziv pars-ing scheme [ZL77; ZL78] or on the Burrows-Wheeler Transform [BW94] that are asymptot-ically optimal both in time and space, while for the second category Compressed Full-Text Indexes [NM07] such as the FM-index [FM05] allow powerful queries on the compressed representation with almost optimal time complexities. However, new applications provide plenty of new challenges, such as the need of specific time/space trade-offs that are not pro-vided by existing solutions or the necessity of new, compact storage systems that exploit the specialties of new datasets to support fast queries on it. In this thesis, our contributions focused on developing tools and techniques that can be helpful in addressing these new challenges.

Efficient and usable space-optimal compressor (Chapter 2). Even though the LZ77 algo-rithm described by Ziv and Lempel [ZL77] is asymptotically optimal in a strong information-theory sense, it does not necessarily yield the lowest possible compression ratio achievable for every possible string. In fact, an active line of research, both in the industry and in academia, focused on improving the achieved compression ratio attainable by a LZ77 scheme for each individual string. Ferragina, Nitto, and Venturini [FNV13] introduced the bit-optimal LZ77 parsing problem, the problem of constructing the most succinct LZ77 compression of any given text, and illustrated an efficient construction algorithm. Their algorithm assumes that uni-versal codes are used for compressing the LZ77 phrases in the parsing (see Section 1.2.3 for an introduction to LZ77 parsing). The algorithm illustrated in their work, albeit theoreti-cally efficient when specific coders are used, is not practical because it involves sophisticated transformations on cache-unfriendly data structures, so it is difficult to implement and likely slow in practice. In this chapter we show a practical, memory-friendly algorithm that matches the time and space complexity of the original solution. This chapter is partially based on paper [FFV14b].

Bicriteria data compressor (Chapter 3). In the industry, many data compressors have been developed to trade, in a very specific way, compression ratio for decompression speed. Un-fortunately, these approaches are ad-hoc and thus not readily applicable to other applicative scenarios, a situation that originated in the development of myriads of different data

(7)

com-Contents pressors, each targeting a different performance profile. In this chapter we target this issue by introducing the Bicriteria Data Compression problem, which asks for the most succinct LZ77 compression that can be decompressed in a given time budget, or vice-versa. Similarly to the bit-optimal LZ77 parsing problem, we assume that phrases are compressed using universal codes. The problem is modeled as a Weight-Constrained Shortest Path Problem on a directed, acyclic graph. We then show an algorithm that solves efficiently the problem by exploiting some peculiarities of that graph. Nicely, the solution reduces the problem to the resolution of a small number of instances of the bit-optimal LZ77 problem, which implies in turn the exis-tence of an efficient and practical construction algorithm. This chapter is partially based on paper [FFV14a].

An efficient, engineered data compressor (Chapter 4). In this chapter we illustrate Bc-Zip, an engineered implementation of the efficient bit-optimal data compressor introduced in Chapter 2 and of the bicriteria data compressor introduced in Chapter 3. We illustrate some algorithmic improvements that aim at making the algorithm even more efficient in practice. We also show that conventional means of providing some decompression time/compression ratio trade-offs on LZ77-compressors are not consistent and sometimes even detrimental, a result which further validates our approach. The benchmarks show that the proposed compressor is extremely competitive with the state-of-the-art when compression ratio and decompression speeds are considered together. This chapter is based on paper [FFV14b]. Succinct index for relative data structures (Chapter 5). On biological applications, there is a need of indexing a great number of documents that are very similar. Because of this, a new line of research has focused on building relative compressed indexes, that is, compressed indexes that are expressed relatively to the index of a similar document in order to save space. These approaches exploit the similarities of the data structures underneath these indexes to lower their total space consumption. For example, the Relative FM-Index of Belazzougui et al. [BGG+14] exploits the fact that the BWT transform of two similar documents are similar, and thus expresses a BWT as the difference of the reference. In this chapter we propose a new relative Lempel-Ziv representation that can compactly represent all differences among similar documents and we show a very efficient implementation supporting fast random access to the (relatively) compressed input text. Being a general solution, this approach can be used to help designing new relative data structures for similar settings. This chapter is based on paper [CFG+16]

(8)

(9)

1 Preliminaries

In this chapter we illustrate some prerequisite topics to this work. We first describe some models of computation that are useful in assessing the efficiency of the algorithms illustrated in this thesis or in guiding the design of models that can accurately predict execution times. We then focus on some fundamental concepts in data compression, the main topic of this thesis.

Basic notation

Logarithms are always in base 2, so log x must be interpreted as log2x. For a sequence

A = a1, . . . , an, let A[i] = ai, that is, the i-th element of A, while A[i, j] denotes the

subse-quence of A from i to j, that is, A[i, j] = ai, ai+1. . . aj. We denote with PrefS(i)the prefix

A[1, i]and with SufA(i)the suffix A[i, n]. We drop the name of the sequence when it can be

unambiguously inferred from the context. Finally, we make use of the notations {xi}i∈Sand

{xi}ni=1to succinctly denote the sets {xi|i ∈ S} and {x1, . . . , xn}, respectively.

1.1 Models of computation

A model of computation estimates the amount of time and memory required to solve different instances of the problem. It is composed of an abstract machine specified in terms of how the memory is organized, how it is accessed, what operations can be performed on it and the execution time of each operation. A good model of computation is simple, to allow the algorithm designer to perform a precise and analytical analysis of the algorithm against it, and universal, so the analysis is meaningful on different computer architectures.

1.1.1 RAM model

The most widely adopted model of computation is the Random Access Machine model — RAM, shortly. The RAM closely models the classical Von Neumann architecture: memory is organized as a table of words of w bits, each word directly accessible through its address, a unique integer. Each access to the memory (either read or write) costs O(1) amount of time.

(10)

1 . PR E L I M I N A R I E S

A CPU is in charge of accessing words from memory and carrying out some computations on them: each “basic” operation on words costs O(1) time. Different models differ on defining precisely the set of “basic” operations: throughout this thesis we assume that these operations are the four basic arithmetic operations (summation, subtraction, multiplication, division) and bit-wise operations like and, or, xor, left/right bitshift.

The Transdichotomous RAM model is an extension of the RAM model introduced by Fred-man and Willard for their Fusion Tree data structure [FW93]. The Transdichotomous RAM model extends RAM by assuming that w = O(log n). This assumption models the fact that the maximum amount of memory that can be accessed by a real-world computer is limited by the highest memory reference that can be stored in a word, so a machine with a word size of w can access at most 2w words of memory. The transdichotomous RAM model is thus an appropriate choice when modeling algorithms that store all their input and output in main memory. Notice that, because of our transdichotomous assumption, basic operations on inte-gers of O(log n) bits take O(1) time, which is not true on the standard Word RAM model or other models of computations such as the Turing Machine. However, this models real-world CPU architectures closely because ALUs (Arithmetic Logic Units, the CPU components in charge of executing arithmetic operations) implement bit-parallelism to allow operations on a word in constant time.

1.1.2 Modeling modern computer hierarchies

The RAM model is simple and universally applicable, as it models the Von Neumann ar-chitecture that is at the basis of the design of nearly every modern computer arar-chitecture. Because of this, it is arguably the most popular model of computation for expressing the com-putational complexity of an algorithm. However, the RAM model makes two fundamental assumptions that are not true in modern computer architectures, namely: uniform time to execute an operation and uniform memory access times. Because of this, sometimes the RAM model might assign the same (asymptotic) complexity to algorithms with wildly different execution times on practical settings. In these cases, more sophisticated models of compu-tation ought to be used. In the algorithmic community, the attention has been focused on conceiving models of computation that better models the memory hierarchy. This is because accessing the main memory is much slower than executing an arithmetic operations (∼ 1ns versus ∼ 100ns) and the gap is increasing with time [Dre07], so improving the memory access behavior has a great impact on running times. We now quickly illustrate some of the most important characteristics of the memory hierarchy from a modeling point of view, and then introduce in Section 1.1.3 more sophisticated models of computation that take into account these details; a comprehensive exposition of the intricacies of modern memory hierarchies has been given by Drepper [Dre07].

A modern memory hierarchy is composed of N levels C1, C2, . . . , CN = M, where each

(11)

1.1. Models of computation

M B C3 B C2 B C1 B CPU

Figure 1.1: Depiction of a memory hierarchy with N = 4 levels and cache line size L. main memory (M ), while the N − 1 levels below it are usually called caches. Two consecutive levels Ciand Ci+1are connected by a bus which allows a bidirectional flow of information

between the two levels. The first level L1 is directly connected to the processor. Figure 1.1

provides a pictorial illustration of a memory hierarchy. The address space is logically par-titioned into cache lines, blocks of memory of size L.1 _{Cache lines are a unit of storage and}

data transfer between memory levels: a cache Ci cannot store fractional amounts of cache

lines, and each bus transports entire cache lines at once. When the CPU needs to access an address x, it forwards the request to the first level C1; if that level holds the line containing

x(a cache hit), then it forwards x back to the processor. Otherwise (cache miss), the cache C1

(i) recursively requests to cache C2 the cache line containing x; (ii) stores the line received

from C2and (iii) replies x back to the processor. In case C1was full at the moment of

request-ing a cache line to C2, it makes room for the incoming line by removing some lines selected

through a cache eviction policy. The recursive chain of requests stops when a cache Ci that

holds x is reached.

The design of memory hierarchies provides efficient memory access to algorithms that exhibit good spatial locality and temporal locality of references:

Spatial locality. An algorithm exhibits good spatial locality of reference if it access addresses close to the last accessed addresses more frequently than addresses that are far apart. More precisely, an algorithm with good spatial locality makes a stream of memory re-quests that can be partitioned into sub-streams, such that each sub-stream is composed of long runs of memory requests where the next address in the sequence is close to the

1_{In principle, each cache-level C}

imight logically partitions the memory address space into a distinct lines

size. However, many modern architectures (such as Intel architectures dating back from the Pentium 4) use the same cache line for all memory levels.

(12)

last one. Spatial locality exploits well memory hierarchies because memory transfers happen in blocks: accessing n memory addresses linearly, that is, sequentially request-ing addresses x, x + 1 . . . , x + n, is more efficient than accessrequest-ing n random addresses because in the first case only up to n/L line transfers from M must be performed, as opposed to up to n memory transfers in the latter case. Moreover, modern memory hierarchies optimize even further spatial locality through prefetching, which lowers even further the time needed to read chunks of memory. In fact, when a sequence of cache misses to contiguous cache lines is triggered by the processor, it instructs the cache to “stream” adjacent lines without an explicit processor request. After prefetching takes place, accessing subsequent characters from the cache Cihas the same cost as reading

them from the closest (fastest) cache.

Temporary locality. As it has been just mentioned, a cache eviction policy is in charge of selecting cache lines to be evicted from the cache to make room for some incoming cache lines from higher levels. The goal of a cache eviction policy is to maximize the cache hit rate, that is, to maximize the probability of already having in cache the cache lines requested in the future. This problem is also known as the paging problem. Usually, caches employ (variants of) the Least Recently Used (LRU) policy. A LRU policy always evicts the cache line that has been last accessed furthest in time. This implies that algorithm that tends to access recently addresses memory references is more efficient because the probability of accessing a non-evicted cache — and thus the hitting rate — is higher.

An algorithm exhibits good locality of reference if it exhibits both local and spatial locality.

1.1.3 Models of computation for memory hierarchies

Here we briefly review some computational models that more closely model the performance of algorithms when executed on modern memory hierarchies.

Two-level memory model. The most widely known is probably the two-level memory model, introduced by Aggarwal and Vitter [AV88]. In this model memory is split in two: a main memory M and an external memory. Main memory is limited, external memory is unbounded. Main memory can be accessed directly and each access does not cost any-thing, while the only allowed operation on data stored in external memory is a transfer to/from main memory, paying O(1) disk accesses. The I/O complexity of an algorithm is thus the amount of accessed to the external memory performed. This model captures the fact that the last level of memory hierarchy has an access time that is much bigger than those of previous levels.

Hierarchical Memory Model. Aggarwal et al. [AAC+87] introduced the Hierarchical Mem-ory Model (HMM). The HMM extends the RAM model by charging f (x) = dlog xe units

(13)

1.1. Models of computation of time for accessing memory reference x. In other words, the HMM assumes memory is composed of an infinite amount of memory levels, where each level Cicontains 2i−1

words of memory and has access time i.

Hierarchical Memory with Block Transfer. Aggarwal, Chandra, and Snir [ACS87] extend the HMM with a block transfer operation, where moving a block of ` contiguous words starting at reference x has the same cost of accessing only x. The block transfer opera-tion aims to assign lower complexity to algorithms with good temporary locality, as it approximates both blocked memory transfer between cache lines and prefetching. Uniform Memory Hierarchy model. Alpern et al. [ACF+94] propose the Uniform Memory

Hierarchy model (UMH), a sophisticated model where memory is composed as a se-quence of increasingly bigger memories M0, M1, M2, . . .; each memory Miis connected

with memory Mi+1by means of a “bus”. Processor can only directly access words in

M0, while data can flow between different memories through buses. Memory transfers

happen in blocks, with block size between memories (Mi, Mi+1)growing with i. Each

block takes some time f (i) to be transfered from Mito Mi−1. All buses may work

simul-taneously. In this model, the efficiency of an algorithm is evaluated as the ratio between its RAM and UMH complexities. This model is fairly sophisticated and directly models blocked memory transfers, prefetching and cache eviction policies.

Logarithmic Pipelined Model. Luccio and Pagli [LP93] introduced the Logarithmic Pipelined Model (LPM). This model is an extension of the model proposed by Aggarwal et al. [AAC+87] that allows pipelining, i.e., simultaneously sending more than one requests to memory without having to wait for its completion; this models some memory features like write-back operations. This models the natural parallel implementation of modern memories, as well as features like write-back where requests to write blocks to memory can be queued and pipelined.

These models will serve as a theoretical basis for the Bc-Zip decompression time model illustrated in Chapter 3.

Empirical Performance Models

The models of computation illustrated in the previous sections are formal, architecture-agnostic and thus appropriate to study the (asymptotic) relationship between input size and execution times; as such, they are at the basis of the computational complexity theory. Be-cause of this, a model of computation gives qualitative idea of the performance of an algorithm. In some contexts, such as in computing execution plans [SR08] or in systems designed to se-lect the most efficient algorithm on a per-instance basis [Ric76], a quantitative estimate of the execution time on a particular architecture is needed. In these situation, Empirical Performance Models (EPMs) are valuable tools. In an EPM, instead of formally analyzing an algorithm over

(14)

1 . PR E L I M I N A R I E S Source Σ ∗ Encoder {0, 1} ∗ Decoder Σ ∗ Destination Figure 1.2: A depiction of Shannon’s communication system.

an abstract machine that approximates concrete architectures, the approach consists of run-ning an implementation of the algorithm over a target architecture and correlate some input features (such as input length) with the execution time, in order to devise a predictive model. The field of EPMs is vast and giving a comprehensive overview is beyond the scope of this section; the interested reader can consult the survey of Hutter et al. [HXH+14] for additional details. Here we quickly point out that the SPORE approach of Huang et al. [HJY+10], one of the most accurate general-purpose EPM that involves machine-learning techniques, achieves a relative prediction error |tact− test|/tactbetween the estimated time testand the actual time

tactof about 7% in their set of experiments.

1.2 Data compression

This section introduces the basic tools of data compression used throughout this thesis. Sec-tion 1.2.1 introduces the informaSec-tion-theoretic noSec-tion of Shannon entropy and empirical entropy, which will be used in showing some optimality results in the section devoted to symbolwise coding (Section 1.2.2) and in dictionary compression (Section 1.2.3).

1.2.1 Sources and (empirical) entropy

In a celebrated paper that founded the field of information theory, Shannon [Sha48] intro-duced the concept of entropy of an information source. The entropy is a fundamental topic in information theory because it gives a lower bound on the compression attainable under some universal assumptions. In particular, Shannon studied some fundamental questions involving the following communication system. A source emits an infinite stream of symbols σ1, σ2, . . .belonging to an alphabet Σ. An encoder encodes these symbols as they arrive into

a stream of bits, such that a decoder is able to decode them back the original symbols from the bitstream and finally forwards them to a destination. The system is depicted in Figure 1.2. Shannon studied the problem of giving a lower bound on the average number of bits per sym-bol emitted by the encoder such that the decoder is still able to recover the original symsym-bols emitted by the source. Let us now assume that the source emits symbol σias the outcome of

the evaluation of a random variable Xithat takes values in Σ, where each random variable is

independent and identically distributed with a probability density function p(σ). The entropy of such source is defined as

H = X

σ∈Σ

p(σ) log 1 p(σ)

(15)

1.2. Data compression The term log_p(σ)1 is sometimes denoted as the information content I(σ) of σ. Shannon proved that any encoder encoding such infinite stream of symbols must require at least H bits on average per symbol.

This result holds only when the random variables X1, X2, . . .are independent and

identi-cally distributed. Most of the time, however, this assumption is not realistic, since usually the probability of a symbol is determined by the symbols that precedes it in the stream. Under the assumption that random variables {Xi}∞i=1form a stationary process, the k-th order entropy

is defined as follows: Hk= 1 k X w∈Σk+1 p(w) log 1 p(w)

Where w is a sequence of k+1 symbols drawn from Σ and p(w) is the stationary probability of emitting a sequence w. Put differently, the k-th order entropy is the entropy of groups of k + 1symbols seen as members of a new alphabet Σ0 = Σk+1. Due to this characterization, entropy H is sometimes also called 0-th order entropy H0. In case the random process is an

ergodic Markov chain, then the k-th order entropy can be expressed as follows: Hk = X w∈Σk p(w)X σ∈Σ p(σ|w) log 1 p(σ|w)

Entropy of order k gives a lower-bound on encoders that only look at the previous k symbols. Entropy has been used to prove many optimality results for integer coders (Section 1.2.2) and dictionary compressors (Section 1.2.3). However, since the entropy is defined as a sta-tistical measure of an idealized source, these results hold only in a probabilistic sense and thus do not generally hold for every possible input text; moreover, since the source entropy is often unknown, it cannot be directly used by the encoder algorithm in many applicative settings. Because of this, the notion of empirical entropy has been introduced.

The empirical entropy is defined for strings, instead of (idealized) sources. Let us fix a string S of length n, and let nσ be the number of occurrences of σ in S. The 0-th order

empirical entropy of S is thus defined as follows: H0(S) = X σ∈Σ nσ n log n nσ

The 0-th order empirical entropy is thus defined similarly to the 0-th order entropy but re-placing probabilities with frequencies.

For any symbolwise encoder, i.e., encoders that individually compress each symbol without considering its surrounding symbols, the value nH0(S)is a lower-bound for the compressed

size occupancy of S [Man01].

For encoders that might encode a symbol by first inspecting a context of k previously occurring symbols, a lower-bound on compression is provided by the k-th order empirical entropy. Let Sw be the string obtained by concatenating every symbol that follows every

(16)

1 . PR E L I M I N A R I E S defined as follows: Hk(S) = 1 n X w∈Σk nwH0(Sw)

As for the 0-th order empirical entropy, the k-th order empirical entropy is defined as the k-th order entropy for ergodic Markov sources but replacing probabilities with frequencies. In Section 1.2.3 we illustrate some optimality results for LZ77 using the empirical entropy.

1.2.2 Coding

A code C is a map between symbols in the alphabet Σ to codewords, finite binary strings. A code encodes a sequence of symbols S = σ1. . . σninto a binary sequence C(S) = C(σ1) . . . C(σn)by

replacing each symbol σiwith its codeword C(σi).

Codes can be either fixed or variable length. In a fixed-length code, every codeword has the same length `, as opposed to variable-length where two codewords might have different lengths. Fixed-length codes can only encode alphabets of at most 2`symbols, while variable-length codes can potentially encode alphabets of infinite size.

Every code must be uniquely decodable, that is, there must not exist two sequences of sym-bols S1and S2 such that c(S1) = c(S2). Any injective, fixed-length code is trivially uniquely

decodable; on the other hand, being injective for variable-length codes is not enough to sat-isfy the uniquely decodable property. Variable-length codes usually enforce this condition by satisfying the prefix-free property, which enforces that there are no two symbols such that their codewords is one the prefix of the other. Moreover, the prefix-free property implies that decoding can be efficiently performed in an on-line manner.

In general, encoding the same sequence of symbols with two different encoders yield different compression ratios. A code is said to be optimal if it yields the lowest possible compression ratio.

Let us consider again an i.i.d. source producing symbols in the alphabet Σ with probability density function p(σ). For such a source, the expected codeword length E(C) of a code C is the expected codeword length when encoding symbols emitted by the source:

E(C) = X

σ∈Σ

p(σ)|C(σ)|.

The source coding theorem implies that the expected codeword length cannot be less than the entropy: H0 ≤ E(C); if H0 = E(C), then C is obviously optimal. It can be proved that

the expected codeword length matches the entropy if and only if, for each symbol σ in the alphabet, the length of its encoding |C(σ)| is equal to its information content I(σ) = log_p(σ)1 . In general, no prefix-free code whose expected codeword length matches the entropy exists, since I(σ) might not be an integer. In this case, it can be proved that any optimal prefix-free code satisfies H ≤ E(C) < H + 1, that is, the maximum redundancy is less than one bit [CT06].

(17)

1.2. Data compression Entropy coding

The Huffman algorithm [Huf52] constructs a prefix-free optimal code for any probability distribution over Σ. A prefix-free code built by the Huffman code is specified by a tree, sometimes referred to as Huffman tree. Every Huffman tree satisfies the following properties: (i) each leaf correspond to a distinct symbol in Σ; (ii) each edge is labeled with a bit; (iii) the codeword for symbol σ is given by the concatenation of the labels (bits) in the path from root to σ. In order to decode a Huffman-coded string, either the decoder must know the Huffman tree in advance or the encoder must transmit it along with the encoded string. In the latter case, the whole encoding scheme (Huffman tree plus Huffman-encoded string) is only optimal asymptotically on the length of the encoded string.

Integer coding

Let us now fix the alphabet Σ as the set of natural numbers. Coders for such alphabets are called integer coders. Let us now review some of the most important integer codes.

Minimal binary. a fixed codeword encoder where integers in the set [0, 2w_]_{are represented}

in w bits using their binary representation. For example, when w = 3, integer 3 is represented as 011.

Unary. a variable-length prefix-code where integer n > 0 is represented as a sequence of n − 1ones followed by a single 0; it is optimal (i.e., mean codeword lengths matches the entropy) when p(i) = 1/2i. For example, integer 3 is represented as 110.

Truncated binary encoding. variable-length prefix-code for encoding integers in a range R = [0, m]for some m > 0. This encoding is optimal when each integer in the range is emitted with equal probability. Let ` = dlog₂(m + 1)e. If |R| = m + 1 is a power of two, then every integer in R is simply represented in binary using ` bits per codeword. Otherwise, let u = 2`_{− m. In this case, the truncated binary encoding encodes the first}

uintegers as the smallest binary codewords of length ` − 1, and the last |R| − u integers as the greatest codewords of length `. Decoding is simple: first, we read an integer i? of ` − 1bits. If i? _{≤ u, then that is the encoded integer. Otherwise, we extend i}?_{by reading}

an additional bit and we compute the encoded integer as i?− u.

Golomb [Gol66] and Rice [Ric79] coding. Golomb defined a family of codes, one for each instantiation of a parameter k. Integer n is encoded as a couple of integers (n/k, n mod k). The value n/k is encoded in unary, while the remainder n mod k is encoded in truncated binary encoding. This code is optimal when the integers follow a geometric distribution with parameter k. Rice coding is simply a restriction of Golomb coding by imposing k as a power of two, so n mod k is represented as a simple binary code.

(18)

Universal codes. Let us now also assume that integers are generated by a source with mono-tonically non-decreasing probabilities (so p(n) ≥ p(n + 1)). For such sources, an integer code cis said to be universal if there exists a constant λ such that, for any probability distribution p, it holds Ec[p] ≤ λ · H. Universal codes are useful because they give strong theoretical

guarantees on the compression ratio even though the source model is unknown. This also implies that there is no need to transmit source statistics, which is an advantage over Huff-man coding in these common settings. The most widely known universal code are Elias γ and δ coding [Eli75]. Both encoders represent an integer n > 0 as a couple (dlog2ne, n); in

Elias γ code, the value dlog₂ne + 1is represented in unary, while n itself is represented using dlog₂ne − 1bits by taking its binary representation and stripping the initial 1. Elias δ code is similar, with the only difference that value dlog2e + 1 is encoded using γ coding instead of

unary.

Practical integer coding

While universal codes offer strong theoretical guarantees on achievable compression ratios, many applications implement other encoding schemes that eschew these theoretical proper-ties in exchange of faster decoding speeds in modern architectures.

Inefficiency in practical settings usually arise from the need of explicitly access individual bits by means of many bit-shift and logical and/or operations per codeword. There are many techniques that can speed up decoding speed, such as:

Bit-parallelism. Modern processor can apply some operations simultaneously on all bits in a word, which results in considerable reductions in execution time;

Table lookup. Decoding can be sped up by means of pre-computation and table look-up. For example, a table storing the rank of the lowest bit set to 1 in a byte can speed up the decoding of Elias γ code.

Byte-alignment. Processors can only fetch individual bytes from memory; this means that, if a code starts in the middle of a byte and continues in the next bytes, the algorithm must first align the codeword by applying many times bit-shifts and and/or operators. An easy and effective way to avoid this inefficiency is to have all codeword being always byte-aligned in the first place by having codewords lengths as multiples of one byte. Branchless decoder. In pipelined, superscalar processors, a branch instruction (such as those

used to implement if-then-else commands in programming languages) can cause ineffi-ciencies that are avoided in a branchless decoder design.

Here we report some well-known encoders that exploit these ideas.

Variable byte coding. Byte-aligned codes where each byte is logically composed of two parts: an indicator bit and 7-bits data bits. The indicator bit of a byte is 1 if and only if the byte

(19)

1.2. Data compression it lies in is the last byte of the codeword; the mapped integer is obtained by collating the data fields together.

PFor [ZHN+06]. This coding scheme encodes integers in the range [0, 2w− 1], where w is the word-size in bits. In this scheme, every integer is represented either as coded integers, taking one byte, or as exceptions, taking one word plus one byte. The encoding is conceptually divided into three zones:(i) a header; (ii) a code section and (iii) an exception section. Every integer takes an entry in the code section. If the integer is less than 256, then we plainly encode it in the code section using one byte. Otherwise, we represent the integer in the exception section using one word, and store in the code section the amount of non-exception integers before the next exception. Finally, the header contains a pointer to the first exception and the length of the encoding. Decoding can be simply performed by keeping three pointers to: (i) the next entry in the code section; (ii) the next entry in the exception section, and (iii) the next exception in the code section. This coding exploits byte-alignment and its decoder can be written in a branchless fashion, yielding very high decoding speeds.

The latest trend in the development of modern Lempel-Ziv compressor is to engineer an ad-hoc integer encoder and then devise a parsing strategy that tries to heuristically mini-mizes the expected codeword length. In Chapter 4 we show that the optimal parsing strategy illustrated in Chapter 2 adapts itself to the underlying integer encoder, thus alleviating the need of coming up with new integer coders. An other, important new line of research is instead devoted to develop new techniques aimed at making entropy coding as fast as engi-neered non-entropic coder, such as the FSE encoding2_{used in Zstandard}3_{, which is based on}

Asymmetric Numeral Systems [DTG+15].

1.2.3 Dictionary compression

In the context of dictionary compression, a dictionary is a map between codewords (i.e., binary strings) and strings drawn from Σ. Let us assume that the size of the alphabet is fixed, so |Σ| = O(1), and let us denote symbols in the alphabet as characters. A dictionary compressor replaces strings in the text that are contained in the dictionary with their codewords. Com-pression is achieved when the lengths of the codewords are smaller than the length of the replaced strings. Dictionary compression methods are primarily distinguished by how the dictionary is defined and built. A dictionary can be either static, semi-static or dynamic. A static dictionary is a dictionary that does not depend on the text and is already known to the decoder, such as plain-text English documents. This method is simple, fast and can be effec-tive when compressing uniform texts whose characteristics are already known. However, it is not generic: good compression ratios are achieved when frequent substrings are associated

2_{Reference implementation: https://github.com/Cyan4973/FiniteStateEntropy} 3_{Project homepage: www.zstd.net}

(20)

with short codewords, so it is not possible to define a single dictionary that achieves good compression for every possible text.

In a semi-static approach, a dictionary DS is defined with respect to the input text S.

The dictionary is constructed with the aim to minimize the resulting compression ratio, for example including common substrings and associating shorter codewords to substrings that occur more frequently in the text. However, since the decoder does not know the dictionary in advance, it must be transmitted alongside the compressed text. Transmitting the dictionary might be inefficient in space, and the task of constructing the dictionary that minimizes the total space occupancy is N P-Hard [SS82].

In dynamic dictionary methods, a sequence of dictionaries are defined assuming that com-pression operates in a left-to-right fashion: when the (de)compressor has finished processing the prefix Prefi(S), a dictionary Dibuilt over that suffix is defined; this dictionary is then used

in the next steps in compressing/decompressing the suffix Sufi(S). Dynamic methods do

not require the dictionary to be transmitted (like static methods), and efficiently compresses arbitrary texts (like semi-static methods). Because of this, general dictionary compressors are usually dynamic. Dynamic dictionary methods have been introduced by Ziv and Lempel [ZL77] in 1977; the method introduced in their paper is usually referred to as LZ77. Here we illustrate a variant introduced by Storer and Szymanski [SS82]. In LZ77, the dictionary Di

consists of two classes of strings:

• copies, denoted with couples hd, `i, that can be constructed by appending into position i a sequence of ` characters starting from S[i − d];

• characters, a single character c, denoted with h0, ci.

For example, if the current prefix is abcd, then h3, 2i references the string bc (copying 2 characters starting from position 5 − 3). Notice that having a phrase with d > ` induces a form of run-length encoding, since characters must be copied one at a time: for example, copy h2, 4i represents the string cdcd (copying 4 characters starting from position 5 − 2). In LZ77, compression happens in two steps, that might be executed in parallel:

• parsing, where the input text S is partitioned into phrases S = p1p2. . . pmsuch that each

phrase pibelongs to the LZ77 dictionary;

• coding: each phrase is replaced with its codeword and compressed using an integer encoder.

Parsing

In general, there is more than one way to parse an input text S into LZ77 phrases. For example, the string “banana” can be parsed into b, a, n, a, n, a or into b, a, n, ana. The choice of the parsing to encode has a strong impact on the compression and decompression performances of a LZ77 compressor, as it is well illustrated in Chapters 2 and 3. The most common approach

(21)

1.2. Data compression A L G O R I T H M M B E R D _Y _N _A _M I C O Z E T A

Figure 1.3: A depiction of T (D) with D = {algorithm, amber, dynamic, dynamo, zeta}.

(and the one suggested in the original paper by Lempel and Ziv) is to adopt a variation of the Greedy strategy. At each step, the greedy strategy selects the longest phrase in the dictionary, that is, when processing Sufi(S)it selects as next factor the longest prefix of Sufi(S)that also

occurs as a prefix in a previous suffix Sufj<i(S). A common variation of the greedy strategy,

commonly known as sliding window, is to limit the distance of i − j by some parameters W , called the window. Parameter W thus controls a trade-off between compression ratio and compression efficiency (compression time and amount of memory used).

Perhaps the most obvious implementation of the LZ77 Greedy parsing is to naively com-pare the current suffix Sufi(S)with each previously occurring suffix Sufj<i(S). In this way,

each factor of length `itakes O(n`i)time, as there are O(n) suffixes to compare and each

com-parison takes O(`i)time. Overall, this naive implementation requires O(nP_ili) = O n2

time. However, this can be accelerated to O(n) time by resorting to string indexes like the Suffix Tree and the Suffix Array. We quickly review both of these index data structures in the next section. We first describe the trie and its compressed implementation that will serve as a basis for the Suffix Tree (the first data structure that allowed O(n) time Greedy parsing) and the Suffix Array.

Data structures in dictionary compression

Trie. Let D be a set of strings such that no two strings in D are one the prefix of the another. A trie T built over D is a data structure that efficiently supports the following query: given a pattern P , find all elements in D that are prefixed by P . A trie answers these queries in optimal O(|P | + occ) time, where occ is the number of results. A trie is a multiway tree, recursively defined as follows:

• the root has a child labeled with σ if there is at least one string in D prefixed by σ; • the child of the root node labeled with σ is a trie storing the set D0 _{obtained by first}

selecting all strings in D prefixed by σ and then stripping σ from their prefix, that is: D0_{= {s | σs ∈ D}.}

An example of a trie is depicted in Figure 1.3. Let label(u) be the concatenation of labels in the path from the root to u in the trie. Then, for each string s ∈ D there is a leaf u ∈ T such that

(22)

1 . PR E L I M I N A R I E S A LGORITHM MBER DYNAM _IC O ZET A

Figure 1.4: The result of compacting the trie depicted in Figure 1.3.

label(u) = s. This property implies that membership queries of a pattern P , that is, checking if s ∈ D, can be performed in optimal O(|P |) time by simply navigating the trie.

A trie T can have up to O(N ) nodes, where N =P

s∈D|s| is the sum of the lengths of all

strings in D. A compacted trie reduces the number of nodes in the trie to O(n), where n is the number of strings in D, through a technique named path compaction. A path π = n1, . . . , nm

in T is a maximal non-branching path if the following conditions hold: (i) nodes n1, . . . , nm−1

are non-branching, that is, they have just one child; (ii) either nmis a leaf or it is branching;

(iii) either n1 is the root of T or its father is branching. In other words, a maximal

non-branching path is a path of non-non-branching nodes in T that cannot be extended on both ends. Path compaction replaces every maximal non-branching path π with a single edge whose label is given by the concatenation of labels in π. A compacted trie is defined simply as a trie whose maximal non-branching paths have been path-compacted. Figure 1.4 shows the result of this compacting process when applied to the trie depicted in Figure 1.3. Since every node in a compact trie (except possibly the root) is branching (by definition of path compaction), a compact trie built over a dictionary D with n keys has O(n) nodes.

Lemma 1. A compacted trie build over D has O(|D|) nodes and supports prefix searches of patterns of length ` in optimal O(`) time.

Suffix Tree. A Suffix Tree built over S, first described by Weiner [Wei73], is a compacted trie whose dictionary is the set of suffixes of S. Since any substring of S is a prefix of a suffix of S, the following lemma directly follows from Lemma 1.

Lemma 2. A Suffix Tree built over S has O(|S|) nodes and supports substring search of patterns of length ` in optimal O(`) time.

A suffix tree can be built in O(n) time [FFM00] and, even though the concatenation of all its labels requires O n2_{bits when represented naively, they can be represented using}

O(n log n) bits by representing each substring as a pair of pointers to the start and the end of the string. The Suffix tree ST (S) built over S allows for powerful and efficient pattern matching queries on S, since a prefix search on ST (S) returns all suffixes whose prefix

(23)

1.2. Data compression $ A $ NA $ NA$ BANANA$ NA $ NA$

Figure 1.5: A suffix tree built over S = banana$.

matches the input pattern. As it is illustrated in the next section, this operation allows for an elegant and time-efficient Greedy algorithm.

Suffix Array. A problem with suffix trees this is their space occupancy: even though a suffix tree takes O(n) words of space, the constant hidden in the Big-Oh notation is quite large: careful implementations can take up to 20 bytes per symbol [Kur99]. Suffix arrays, introduced by Manber and Myers [MM93], provide for a more succinct alternative to suffix trees. Let us consider the list L of all suffixes of S sorted in lexicographic order; the suffix array Sa of S is an array of n integers such that Sa[i] is the starting position of the i-th suffix of L, that is, S[Sa[i], n] is the i-th suffix of S in lexicographic order.

A suffix array can be easily built in O(n) time from a suffix tree through an in-order visit. However, it can also be directly built in the same time complexity [KA05; KSB06a; KSP+05] using less working memory.

By itself, the Sa supports prefix search in O(|P | log n) through binary search. However, when augmented with the proper combination of supporting data structures, this can be improved to O(|P | + log n) [MM93]. There are many full-text indexes that extend the Suffix Array or related data structures (such as the Burrows-Wheeler Transform) to provide efficient search operations on S; the interested reader can consult the survey on compressed full-text indexes by Navarro and Mäkinen [NM07] and references therein. Discussing such techniques more in-depth is out of scope for this section, as parsing algorithms do not search for arbitrary patterns on S but rather look for a previous suffix with maximum Longest Common Prefix (lcp) with respect to the current suffix. The longest common prefix of two strings s1 and

s2 is the longest string p such that p = s1[1, |p|] = s2[1, |p|]. This operation can be sped

up by providing an index that answers lcp queries between two arbitrary suffixes quickly. Let lcp(i, j) be the length of the longest common prefix between SufS(i)and SufS(j), and

(24)

1 . PR E L I M I N A R I E S 7 6 4 2 1 5 3 0 1 3 0 0 2 $ a$ ana$ anana$ banana$ na$ nana$ Sa lcp _S

Figure 1.6: Suffix array and lcp array of S = banana.

of the lcp entries between indexes Isa[i] and Isa[j], where Isa is the inverse suffix array that maps suffixes starting positions to their lexicographic rank. Asking for the minimum value in a range is usually referred to as performing a range minimum query. An RMQ data structure takes O(n) bits and answers these range minimum queries in O(1) time. Thus, an lcp array, together with an RMQ built over it and an Isa array answers lcp queries in O(1) time [FH06]. Figure 1.6 shows an example of a Sa augmented with an lcp array. In the example, the lcp between SufS(6) =a and SufS(2) =anana is lcp(2, 6) = min{1, 3} = 1.

Parsing algorithms

The first algorithms that compute the Greedy factorization in linear time were based on suffix trees, while subsequent algorithms are based on suffix arrays due to the lower working space needed. We now briefly review the two approaches.

Suffix Tree algorithm. Linear-time greedy strategies based on suffix trees are conceptually simple; the first algorithm has been described by Rodeh, Pratt, and Even [RPE81] in 1981. The main idea is to instrument the suffix tree of S so to add, in each internal node v, a link leftmost(v) pointing to the node with the leftmost associated suffix among all of the node in its subtree. In other words, for each node v in the Suffix Tree, node leftmost(v) points to the leftmost occurrence of label label(v). Let us now assume the Greedy parsing must find a factor starting at position i. The algorithm starts traversing the Suffix Tree starting at the root and searching for suffix Sufi(S). The algorithm continues the traversal until it hits a node v0

such that leftmost(v0)is associated to Sufi(S). Let Sufj(S)be the suffix associated to be the

parent node v00of v0in the Suffix Tree. It can be easily verified that (i) there is no occurrence of label(v0)in S starting before i (because leftmost(v0) =Sufi(S)), (ii) label(v00)does occur before

iin S (because otherwise the search would have been stopped in that node) and (iii) Sufj(S)

is the suffix in S with the longest common prefix with Sufi(S)among those occurring at the

(25)

1.2. Data compression `positions in S. This algorithm takes O(n) time and words of space.

Suffix array. The first approach using suffix arrays to compute the LZ77 parsing has been first illustrated by Abouelhoda, Kurtz, and Ohlebusch [AKO04] as an application of their “enhanced suffix array” data structure. In their approach they emulate a bottom-up visit of the suffix of the input text “enhanced” with the support of an additional tree data structure built over LCP values. Subsequent works [CI08; CIS08; CPS08; GB13; KKP13; KP13b; OG11] focused on simpler approaches to compute the LZ77 factorization directly on suffix arrays.

Let us illustrate a simple observation introduced by Crochemore and Ilie [CI08]. Since suffixes are sorted in lexicographic order, suffixes closer to Sa[i] in the Sa have longer lcps with respect to suffixes that are farther apart. This effectively turns the problem into the problem of efficiently computing batch psv/nsv queries on Sa:

Definition 3. For an array A, let psvA(i) = maxj∈[1,i−1]{j | A[j] < A[i]} and nsvA(i) =

min_j∈[i+1,n]{j | A[j] < A[i]}

So psv_A(i)is the previous smaller value of i in A and nsvA(i), is the next smaller value. From

the observation above, the following lemma easily follows:

Lemma 4. Let SufS(−∞) be the empty string. Then, for any position i, either SufS(psvSa(i))

or SufS(nsvSa(i))is a suffix with maximum longest common prefix among those occurring before

SufS(i).

Almost every O(n)-time approach on computing the LZ77 factorization directly using suffix arrays exploit Lemma 4. For example, the algorithm of Goto and Bannai [GB13] first use a stack to identify every psv and nsv value for each position and then compute the length of the lcp. For more details about the state of the art of parsing algorithms, Al-Hafeedh et al. [ACI+12] provide a comprehensive survey on the topic. In Chapter 2, we introduce a parsing strategy that has been inspired by the ideas illustrated in this section.

Optimality results

A number of optimality results for LZ77 have been proved in the last fifty years. Ziv and Lem-pel [ZL77] prove optimality for certain deterministic family of sources; subsequent research focused on studying the performance of LZ77 when encoding the output of an idealized Markovian source; among these:

• Wyner and Ziv [WZ94] prove that, when using a greedy strategy and Elias γ-code for encoding lengths and minimal binary for distances, the compression ratio converges to the source’s entropy; while

• Savari [Sav98] show that the redundancy of LZ77, that is, the difference between the aver-age number of bits per symbol and the source entropy, shrinks with a term Olog log n_{log n}

(26)

on a unifilar Markovian source, using the greedy strategy and Elias δ-code for encoding lengths and minimal binary for distances.

However, these results hold only on the average case and for these idealized sources. Kosaraju and Manzini [KM99] tackle these issues by analyzing the LZ77 performance against the empirical entropy, and shows that the difference between the average number of bits of the LZ77parsing of S converges to its k-th order for all k when the length of S goes to infinity, a result known as coarse optimality. However, this only holds asymptotically and, on low-entropy strings, the ratio between the average number of bits per symbol and the low-entropy can be different from zero even asymptotically.

So there is still room for improving compression ratios. In fact, even though the Greedy strategy minimizes the number of phrases due to the suffix-complete property of the LZ77 dictionary, it does not necessarily minimize compression ratios when using variable-length encoders. Consider, for example, the rightmost and leftmost Greedy variants. These two variants specify, for any phrase p = S[i, i+`] in the Greedy parsing of an input text S, which of the occurrences of p in S must be referenced by the copy hd, `i associated with p. The leftmost variant associates the first (or leftmost) occurrence of p in S, while the rightmost associates the last (or rightmost) occurrence of p in S that starts before i. When using universal codes for compressing a parsing, the rightmost strategy yields better compression ratios because it selects copies with smaller distances, which take less bits when using universal codes. This fact contributed to the development of efficient rightmost parsers [BP16], even though these solutions are more complicated and sophisticated than leftmost parsers (such as the one based on Suffix Tree illustrated earlier in this section).

In fact, Ferragina, Nitto, and Venturini [FNV13] introduced the bit-optimal LZ77 parsing problem, that is, the problem of constructing the most succinct LZ77 parsing when using universal codes for compressing the parsing. In their work, they show that the ratio between the compression ratio of greedy and bit-optimal is Ω_{log log n}log n for a family of input strings. Notice that this does not contradict the coarse optimality results that we just illustrated. They also illustrate an algorithm for constructing the bit-optimal LZ77 parsing in O(n log n) time. However, their algorithm cannot be applied when using entropy coders, which are more compact than universal codes.

(27)

2 On Space-Optimal Compression

In this chapter we discuss the bit-optimal LZ77 parsing problem (BLPP), defined as follows: given a text S and two integer encoders enc_distand enc_len, determine the minimal-space LZ77 parsing of S provided that all phrases are encoded with two encoders encdist(for the distance

component) and enclen(for the length component) that satisfy the non-decreasing cost property:

Property 5. An integer encoder enc satisfies the non-decreasing cost property if |enc(n)| ≤ |enc(n0)| for all positive integers n ≤ n0.

Let s(d, `) denote the length in bits of the encoding of hd, `i, that is, s(d, `) = |encdist(d)| +

|enc_len(`)|. We assume that a phrase consisting of one single character is represented by a fixed amount of bits. Let also denote with scosts the number of distinct values assumed by

s(d, `)when d, ` ≤ n.

Ferragina, Nitto, and Venturini [FNV13] have shown an (asymptotically) efficient algorithm for this problem, proving the following theorem.

Theorem 6([FNV13, Theorem 5.4]). Given a string S and two integer-encoding functions encdist

and enclenthat satisfy Property 5, there exists an algorithm to compute the bit-optimal LZ77 parsing

of S in O(n · scosts)time and O(n) words of space.

The algorithm illustrated in Theorem 6, albeit asymptotically optimal in the RAM model, is not efficient in practice due to its expensive pattern of memory accesses. In Section 2.2 we illustrate a novel BLPP algorithm that is significantly faster in practice while still achieving the same time/space asymptotic complexities stated in the theorem above. This results has been obtained thanks to the use of simpler data structures and cache-friendly memory access patterns (see Section 2.2). This represents an important step in closing the compression-time gap between the bit-optimal LZ77 strategy and the widely used compressors such as Gzip1_{, Bzip2}2 _{and Lzma2}3_{. Our new algorithm relies on the same optimization framework}

devised by Ferragina, Nitto, and Venturini [FNV13]. For the sake of clarity, we illustrate that framework in Section 2.1, restating some lemmas and theorems proved in that paper. Shortly,

1_{Homepage: http://www.gzip.org/} 2_{Homepage: http://www.bzip.org/}

(28)

2 . ON SPA C E- OP T I M A L CO M P R E S S I O N

the authors show how to reduce the BLPP to a shortest path problem on a DAG G with one node per character and one edge for every LZ77 phrase. However, even though the shortest path problem on DAGs can be solved linearly in the size of the graph, a straight resolution of BLPP would be inefficient because G might consist of O n2_{edges. The authors solve this problem}

in two steps: first, G is pruned to a graph eG with just O(n · scosts)edges, still retaining at least

one optimal solution; second, the graph eG is generated on-the-fly in constant amortized-time per edge, so that the shortest path can be computed in the time and space bounds stated in Theorem 6. We notice that the codewords generated by the most interesting universal integer encoders have logarithmic length (i.e., scosts= O(log n)), so the algorithm of Theorem 6 solves

BLPPin O(n log n) time. However this algorithm requires the construction of suffix arrays and compact tries of several substrings of S (possibly transformed in proper ways), which are then accessed via memory patterns that make the algorithm pretty inefficient in practice.

In Section 2.2 we show our new algorithm for the second step above, which is where our novel solution differs from the known one [FNV13]. As previously mentioned, our algorithm only performs scans over a few lists, so it is much simpler and therefore much more efficient in practice. A comprehensive set of experiments validating our approach is given in Chapter 4.

2.1 A review of the Bit-Optimal LZ77 parsing strategy

Let us assume that LZ77-phrases hd, `i are compressed by using two distinct, stateless integer encoders encdistand enclenthat satisfy the non-decreasing property. These two assumptions

are not overly restrictive because they encompass all known universal encoders, such as Trun-cated binary, Elias’ Gamma and Delta [Eli75; Gol66] and Lz4’s4 encoder5. In the following we denote with s(P) = P

hd,`i∈Ps(d, `)the length in bits of the compressed output

gener-ated according to the LZ77-parsing P. Moreover, let Qdistbe the number of times the value

|enc_dist(d)|changes when d = 1, . . . , n and let Qlen be defined in a similar way. Under the

assumption that both encdistand enclen satisfy the Property 5, it holds scosts = Qdist+ Qlen.

Most interesting integer encoders, such as the ones listed above, encode integers in a number of bits proportional to their logarithm, and, thus, both Qdistand Qlenare O(log n). A summary

of the notations used throughout this section is given in Table 2.1.

2.1.1 A graph-based approach

Given an input string S of length n, BLPP is reduced to a shortest path problem over a weighted DAG G that consists of n + 1 nodes, one per input character plus a sink node, and m

4_{Homepage: https://github.com/lz4/lz4}

5_{We notice that other well-known integer encoders that does not satisfy the non-decreasing cost property,}

such as PForDelta and Simple9, are indeed not universal and, crucially, are designed to be effective only in some specific situations, such as the encoding of Inverted Lists. Since these settings have very specific assumptions about symbols distribution that are generally not satisfied by LZ77 parsings, they are not examined in this work.

(29)

2.1. A review of the Bit-Optimal LZ77 parsing strategy Table 2.1: Summary of main notations used in this chapter.

Name Definition

S A (null-terminated) document to be compressed. n Length of S (end-of-text character included).

hd, `i A LZ77 phrase of length ` and can be copied at distance d. h0, ci A LZ77 phrase that represents the single character c. s(d, `) The length in bits of the encoding of hd, `i.

s(π) The space occupancy of parsing π.

scosts The number of distinct values that may be assumed by s(d, `) when d ≤

n, ` ≤ n.

Q Number of cost classes for the integer encoder enc (may depend on n). L(k) Codeword length (in bits) of integers belonging to the kth cost class of enc. M (k) Largest integer encodable by the kth cost class of enc.

E(k) Number of integers encodable within the kth cost class (i.e., E(k) = M (k)− M (k − 1)).

W (k, j) Defined as the jth block of E(k) contiguous nodes in the graph G.

Sa[1, n] The suffix array of S, hence Sa[i] is the ith lexicographically smallest suffix of S.

Isa[1, n] The inverse suffix array, so Isa[j] is the (lexicographic) position of S[j, n] among all text suffixes, hence in Sa (i.e., j = Sa[Isa[j]]).

edges, one per possible LZ77-phrase in S. In particular there are two different types of edges to consider:

(i) edge (i, i + 1), which represents the case of the single-character phrase S[i], and

(ii) edge (i, j), with j = i + ` > i + 1, which represents the substring S[i, i + ` − 1] that also occurs previously in S.

This construction is similar to the one originally proposed by Schuegraf and Heaps [SH74]. By definition, every path π in this graph is mapped to a distinct parsing P, and vice versa. An edge (i0, j0)is said to be nested into an edge (i, j) if i ≤ i0 < j0≤ j. A peculiar property of G that is extensively exploited throughout this and the next chapter is that its set of edge is closed under nesting:

Property 7. Given an edge (i, j) of G, any edge nested in (i, j) is an edge of G.

This property follows quite easily from prefix/suffix-complete property of the LZ77 dic-tionary and from the bijection between LZ77 phrases and edges of G. Since each edge (i, j) of G is associated with a LZ77 phrase, it can be weighted with the length, in bits, of its codeword. In particular:

(30)

2 . ON SPA C E- OP T I M A L CO M P R E S S I O N

(ii) any other edge (i, j) with j > i + 1, a copy of length ` = j − i copying from some position i − d, is instead weighted with the value sG(i, j) = s(d, `) = |enc(d)| + |enc(`)|, where the

notation hd, `i is used to denote the associated codeword.

As it currently stands, the definition of G is ambiguous whenever there exists a substring S[i, j − 1] occurring more than once previously in the text: in this case, phrase S[i, j − 1] references more than one string in the LZ77 dictionary. We fix this ambiguity by select one phrase hd, `i among those that minimizes the expression s(d, `). Hence, the space cost sG(π)

of any 1n-path π (from node 1 to n + 1), defined as the sum of the costs of its edges, is the compressed size s(P) in bits of the associated parsing P. Computing the bit-optimal parsing of S is thus equivalent to finding a shortest path on the graph G. In the following we drop the subscript from sGwhenever this does not introduce ambiguities. Since G is a DAG, computing

the shortest 1n-path takes O(m + n) time and space; unfortunately, there are strings for which m = Θ(n2) (e.g., S = an _{generates one edge per substring of S). This means that a naive}

implementation of the shortest-path algorithm would not be practical even for files of a few tens of MiBs. Two ideas have been deployed to solve this problem [FNV13]:

1 reduce G to an asymptotically smaller subset with only O(n · scosts) edges in which the

bit-optimal path is preserved;

2 generate on-the-fly the edges in the pruned graph by means of an algorithm, called FSG, that takes O(1) amortized time per edge and optimal O(n) words of space. This is where our novel solution, illustrated in Section 2.2, differs.

Pruning the graph

We now illustrate how to construct a reduced subgraph eG with only O(n · scosts)edges that

retains at least one optimal 1n-path of G. It is easy to show that, if both encdist and enclen

satisfy the non-decreasing property, then all edges outgoing from a node have non-decreasing space costs when listed by increasing endpoint, from left to right.

Lemma 8. If both encdist and enclen satisfy Property 5 then, for any node i and for any couple of

nodes j < k such that (i, j), (i, k) ∈ G, it holds s(i, j) ≤ s(i, k).

This monotonicity property for the space costs of G’s edges leads to the notion of maximal-ity of an edge.

Definition 9. An edge e = (i, j) is said to bemaximal if and only if either the (next) edge e0 = (i, j + 1)does not exist, or it does exist but s(i, j) < s(i, j + 1).

Recall that scosts is the number of distinct values assumed by s(d, `) for d, ` ≤ n. By

definition, every node has at most scosts maximal outgoing edges. The following lemma

(31)

2.1. A review of the Bit-Optimal LZ77 parsing strategy Lemma 10. For each triple of nodes i < i0 < j, and for each path π from i to j, there exists a path π0 from i0to j such that s(π0) ≤ s(π).

The lemma, used with j = n, allows us to “push” non-maximal edges to the right by iteratively substituting non-maximal edges with maximal ones without augmenting the time and space costs of the path. By exploiting this lemma, Theorem 11 shows that the search of shortest paths in G can be limited to paths composed of maximal edges only.

Theorem 11. For any 1n-path π there exists a 1n-path π?composed of maximal edges only and such that the space cost of π?_{is not worse than the space cost of π, i.e., s(π}?_{) ≤ s(π).}

Proof. We show that any path π containing non-maximal edges can be turned into a 1n-path π0containing maximal edges only. Take the leftmost non-maximal edge in π, say (v, w), and denote by πv(resp. πw), the prefix (resp. suffix) of path π ending in v (resp. starting from

w). By definition of maximality, there must exist a maximal edge (v, z), with z > w, such that s(v, z) = s(v, w). We can apply Lemma 10 to the triple (w, z, n) and thus derive a path µfrom z to n such that s(µ) ≤ s(πw). We then construct the 1n-path π00 by connecting the

sub-path πv, the maximal edge (v, z), and the path µ: using Lemma 10 one readily shows that

the space cost of π00is not larger than that of π. The key result is that we pushed right the leftmost non-maximal edge (if any) that must now occur (if ever) within µ; by iterating this argument we get the thesis.

Hence, the pruned graph eG obtained by keeping only maximal edges in G has O(n · scosts)

edges. Most interesting integer encoders, such as Truncated binary, Elias’ Gamma and Delta [Eli75], Golomb [Gol66], and Lz4’s encoder, encode integers in a number of bits pro-portional to their logarithm, thus scosts = O(log n)in practical settings, so eG is asymptotically

sparser than G, having only O(n log n) edges.

Theorem 12. The subgraph eG, defined by retaining only the maximal edges of G, preserves at least one shortest 1n-path of G and has O(n log n) edges when using encoders with a logarithmically growing number of cost classes.

The on-the-fly generation idea

As discussed previously, the pruning strategy consists of retaining only the maximal edges of G, that is, every edge of maximum length among those of equal cost. Now we describe the basic ideas underlying the generation of these maximal edges incrementally along with the shortest-path computation in O(1) amortized time per edge and only O(n) auxiliary space. We call this task the Forward Star Generation (shortly, FSG). Their efficient generation needs to characterize them in a slightly different way. Let us assume, for simplicity’s sake, that enc is used to encode both distances and lengths (that is, enc = encdist = enclen). If the

function |enc(x)| changes only Q times when x ≤ n, then enc induces a partitioning of the range [1, n] (namely, the range of potential copy-distances and copy-lengths of LZ77-phrases