Space and Time-Efﬁcient Data Structures for Massive Datasets

(1)

Space and Time-Efficient Data Structures

for Massive Datasets

Giulio Ermanno Pibiri

giulio.pibiri@di.unipi.it Supervisor

Rossano Venturini

Computer Science Department University of Pisa

17/10/2016

(2)

(3)

Evidence

The increase of information

does not scale with technology.

(4)

Evidence

“Software is getting slower more rapidly than hardware becomes faster.”

Niklaus Wirth, A Plea for Lean Software

The increase of information

does not scale with technology.

(5)

Evidence

“Software is getting slower more rapidly than hardware becomes faster.”

Niklaus Wirth, A Plea for Lean Software

The increase of information does not scale with technology.

Even more relevant today!

(6)

Scenario

Data Structures

PERFORMANCE

Algorithms

EFFICIENCY

“how quickly a program does its work” - faster work

“how much work is required

by a program” - less work

(7)

Scenario

Data Structures

PERFORMANCE

Algorithms

EFFICIENCY

“how quickly a program does its work” - faster work

“how much work is required by a program” - less work + time

- space

(8)

Scenario

Data Compression Data Structures

PERFORMANCE

Algorithms

EFFICIENCY

“how quickly a program does its work” - faster work

“how much work is required by a program” - less work + time

- space

(9)

Scenario

Data Compression Data Structures

PERFORMANCE

Algorithms

EFFICIENCY

“how quickly a program does its work” - faster work

“how much work is required by a program” - less work

+ space - time

+ time

- space

(10)

Scenario

Data Compression Data Structures

PERFORMANCE

Algorithms

EFFICIENCY

“how quickly a program does its work” - faster work

“how much work is required by a program” - less work

+ space - time

?

+ time

- space

(11)

Small VS Fast?

Dichotomy Problem

(12)

Small VS Fast?

Dichotomy Problem

Choose one.

(13)

Small VS Fast?

NO

Dichotomy Problem

Choose one.

(14)

High Level Thesis

Data Structures + Data Compression Faster Algorithms

(15)

High Level Thesis

“Space optimization is closely related to time optimization in a disk memory.”

Donald E. Knuth, The Art of Computer Programming

Data Structures + Data Compression Faster Algorithms

(16)

CPU

registers

L1 L2 RAM

Hierarchical Memory Organisation

DISK

(17)

CPU

registers

L1 L2 RAM

Hierarchical Memory Organisation

DISK

Size

Speed

(18)

CPU

registers

L1 L2 RAM

Hierarchical Memory Organisation

DISK

Size Speed

32-64 bits

1 ns

32 KB 256 KB 4-32 GB 1 TB

1-10 ms

Numbers are taken from: https://gist.github.com/jboner/2841832

0.5 ns 7 ns 100 ns

(19)

CPU

registers

L1 L2 RAM

Hierarchical Memory Organisation

DISK

14

x

²⁰⁰^X

Size Speed

32-64 bits

1 ns

32 KB 256 KB 4-32 GB 1 TB

1-10 ms

0.5 ns 7 ns 100 ns

(20)

Proposal

Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective,

that support fast data extraction.

(21)

Proposal

Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective,

that support fast data extraction.

Data compression & Fast Retrieval together.

Mature algorithmic solutions

now ready for technology transfer.

(22)

Proposal

Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective,

that support fast data extraction.

Must exploit properties of the addressed problem.

Data compression & Fast Retrieval together.

Mature algorithmic solutions

now ready for technology transfer.

(23)

Why? Who cares?

Because engineered data structures are the ones the boost the availability

and wealth of information around us.

(24)

Why? Who cares?

Because engineered data structures are the ones the boost the availability

and wealth of information around us.

• Inverted Indexes

• N-grams

• B-trees

• …

(25)

Why? Who cares?

Because engineered data structures are the ones the boost the availability

and wealth of information around us.

• Inverted Indexes

• N-grams

• B-trees

• …

(26)

Why? Who cares?

Because engineered data structures are the ones the boost the availability

and wealth of information around us.

• Inverted Indexes

• N-grams

• B-trees

• …

(27)

Why? Who cares?

Because engineered data structures are the ones the boost the availability

and wealth of information around us.

• Inverted Indexes

• N-grams

• B-trees

• …

(28)

Why? Who cares?

Because engineered data structures are the ones the boost the availability

and wealth of information around us.

• Inverted Indexes

• N-grams

• B-trees

• …

(29)

Why? Who cares?

Because engineered data structures are the ones the boost the availability

and wealth of information around us.

• Inverted Indexes

• N-grams

• B-trees

• …

(30)

Why? Who cares?

Because engineered data structures are the ones the boost the availability

and wealth of information around us.

• Inverted Indexes

• N-grams

• B-trees

• …

(31)

Wait!

Can’t we use existing libraries?

(32)

Wait!

Can’t we use existing libraries?

C++ Standard Template Library (STL)?

(33)

Wait!

Can’t we use existing libraries?

C++ Standard Template Library (STL)?

std::unordered_map std::list

std::map

std::stack

std::queue

(34)

Wait!

Can’t we use existing libraries?

C++ Standard Template Library (STL)?

std::unordered_map std::list

std::map

std::stack

std::queue

(35)

Wait!

Can’t we use existing libraries?

C++ Standard Template Library (STL)?

std::unordered_map std::list

std::map std::stack std::queue

Prefer cache-friendly (non discontiguous) data structures.

Always.

Use std::vector.

(36)

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

memory vector of pointers

(37)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

memory vector of offsets

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

(38)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

(39)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

(40)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

(41)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

(42)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

(43)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

~5.2GBs

(44)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

~5.2GBs

(45)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

~5.2GBs

(46)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

~5.2GBs

~3GBs

(47)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0

5 8 12 18 20 25 26 28 34 35 38 40

Prefer cache-friendly data structures.

Example: vector of strings VS string pool

Every access is a cache miss.

Pointer chasing :-(

Offsets instead of pointers.

Contiguous memory layout.

~5.2GBs

X5

(48)

Inverted Indexes

For each term t in T we store in a list L

t

the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a

(multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {L

t₁,…,

L

t_T

} is the inverted index.

1

(49)

Inverted Indexes

For each term t in T we store in a list L

t

the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a

(multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {L

t₁,…,

L

t_T

} is the inverted index.

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always

1

(50)

Inverted Indexes

For each term t in T we store in a list L

t

the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a

(multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {L

t₁,…,

L

t_T

} is the inverted index.

house is red

red is always

good the

boy house red

is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

1

(51)

Inverted Indexes

For each term t in T we store in a list L

t

the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a

(multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {L

t₁,…,

L

t_T

} is the inverted index.

house is red

red is always

good the

boy house red

is the

always

2

1

3

5 T = {always, boy, good, house, hungry, is, red, the}

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

1

(52)

Inverted Indexes

For each term t in T we store in a list L

t

the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a

(multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {L

t₁,…,

L

t_T

} is the inverted index.

house is red

red is always

good the

boy house red

is the

always hungry

2

1

3

4

5 T = {always, boy, good, house, hungry, is, red, the}

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

1 L

t₁

=[1, 3]

L

t₂

=[4, 5]

L

t₃

=[1]

L

t₄

=[2, 3]

L

t₅

=[3, 5]

L

t₆

=[1, 2, 3, 4, 5]

L

t₇

=[1, 2, 4]

L

t₈

=[2, 3, 5]

(53)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

2

(54)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

2

house is red

red is always

good the

boy house red

is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

2

1

3

4

5 L

t₁

=[1, 3]

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

L

t₂

=[4, 5]

L

t₃

=[1]

L

t₄

=[2, 3]

L

t₅

=[3, 5]

L

t₆

=[1, 2, 3, 4, 5]

L

t₇

=[1, 2, 4]

L

t₈

=[2, 3, 5]

(55)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

2 q = {boy, is, the}

house is red

red is always

good the

boy house red

is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

2

1

3

4

5 L

t₁

=[1, 3]

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

L

t₂

=[4, 5]

L

t₃

=[1]

L

t₄

=[2, 3]

L

t₅

=[3, 5]

L

t₆

=[1, 2, 3, 4, 5]

L

t₇

=[1, 2, 4]

L

t₈

=[2, 3, 5]

(56)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

2 q = {boy, is, the}

house is red

red is always

good the

boy house red

is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

2

1

3

4

5 L

t₁

=[1, 3]

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

L

t₂

=[4, 5]

L

t₃

=[1]

L

t₄

=[2, 3]

L

t₅

=[3, 5]

L

t₆

=[1, 2, 3, 4, 5]

L

t₇

=[1, 2, 4]

L

t₈

=[2, 3, 5]

(57)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

2 q = {boy, is, the}

house is red

red is always

good the

boy house red

is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

2

1

3

4

5 L

t₁

=[1, 3]

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

L

t₂

=[4, 5]

L

t₃

=[1]

L

t₄

=[2, 3]

L

t₅

=[3, 5]

L

t₆

=[1, 2, 3, 4, 5]

L

t₇

=[1, 2, 4]

L

t₈

=[2, 3, 5]

posting lists intersection

(58)

General Problem

Consider a sequence S[0,n) of n positive and monotonically

increasing integers, i.e., S[i] ≤ S[i+1] for 0 ≤ i < n-1, possibly repeated.

How to represent it as a bit vector in which each original

integer is self-delimited, using as few as possible bits?

(59)

General Problem

Consider a sequence S[0,n) of n positive and monotonically

increasing integers, i.e., S[i] ≤ S[i+1] for 0 ≤ i < n-1, possibly repeated.

How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits?

Huge research corpora describing different space/time trade-offs.

•

Elias gamma/delta [Elias, TIT 1975]

•

Variable Byte [Salomon, Springer 2007]

•

Binary Interpolative Coding [Moffat and Stuiver, IRJ 2000]

•

Simple-9 [Anh and Moffat, IRJ 2005]

•

PForDelta [Zukowski et al., ICDE 2006]

•

OptPFD [Yan et al., WWW 2009]

•

Simple-16 [Anh and Moffat, SPE 2010]

•

Varint-G8IU [Stepanov et al., CIKM 2011]

•

Elias-Fano [Vigna, WSDM 2013]

•

Partitioned Elias-Fano [Ottaviano and Venturini, SIGIR 2014]

(60)

General Problem

Consider a sequence S[0,n) of n positive and monotonically

increasing integers, i.e., S[i] ≤ S[i+1] for 0 ≤ i < n-1, possibly repeated.

How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits?

Huge research corpora describing different space/time trade-offs.

•

Elias gamma/delta [Elias, TIT 1975]

•

Variable Byte [Salomon, Springer 2007]

•

Binary Interpolative Coding [Moffat and Stuiver, IRJ 2000]

•

Simple-9 [Anh and Moffat, IRJ 2005]

•

PForDelta [Zukowski et al., ICDE 2006]

•

OptPFD [Yan et al., WWW 2009]

•

Simple-16 [Anh and Moffat, SPE 2010]

•

Varint-G8IU [Stepanov et al., CIKM 2011]

•

Elias-Fano [Vigna, WSDM 2013]

•

Partitioned Elias-Fano [Ottaviano and Venturini, SIGIR 2014]

(61)

For Firefly, Dropbox full text-search engine, speed has always been a priority.

They were unable to scale because of the dimension of their (distributed) inverted index.

Consequence? Query time latencies deteriorate from 250ms to 1s.

https://blogs.dropbox.com/tech/2016/09/improving-the-performance-of-full-text-search/

(62)

For Firefly, Dropbox full text-search engine, speed has always been a priority.

They were unable to scale because of the dimension of their (distributed) inverted index.

Consequence? Query time latencies deteriorate from 250ms to 1s.

September 7, 2016

(63)

For Firefly, Dropbox full text-search engine, speed has always been a priority.

They were unable to scale because of the dimension of their (distributed) inverted index.

Consequence? Query time latencies deteriorate from 250ms to 1s.

Solution?

September 7, 2016

(64)

For Firefly, Dropbox full text-search engine, speed has always been a priority.

They were unable to scale because of the dimension of their (distributed) inverted index.

Consequence? Query time latencies deteriorate from 250ms to 1s.

Solution?

Compress the index to reduce I/O pressure.

September 7, 2016

(65)

For Firefly, Dropbox full text-search engine, speed has always been a priority.

They were unable to scale because of the dimension of their (distributed) inverted index.

Consequence? Query time latencies deteriorate from 250ms to 1s.

Solution?

Compress the index to reduce I/O pressure.

September 7, 2016

Data Structures + Data Compression Faster Algorithms

(66)

Elias-Fano - Genesis

Peter Elias

[1923 - 2001]

Robert Fano

[1917 - 2016]

Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).

Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, 246–260 (1974).

1

(67)

Elias-Fano - Genesis

Peter Elias

[1923 - 2001]

Robert Fano

[1917 - 2016]

Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).

Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, 246–260 (1974).

Sebastiano Vigna. Quasi-Succinct Indices.

In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), 83-92 (2013).

40 years later!

1

(68)

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43

1 2 3 4 5 6 7 8

2

(69)

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43 u =

1 2 3 4 5 6 7 8

2

(70)

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43 u =

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(71)

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43 u =

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(72)

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43 u =

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(73)

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43

L = 011100111101110111101011

u =

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(74)

3 Elias-Fano - Encoding example

3 4 7 13 14 15 21 43

L = 011100111101110111101011

u =

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(75)

3 3

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43

L = 011100111101110111101011

u =

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(76)

1 3 3

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43

L = 011100111101110111101011

u =

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(77)

1 1 3 3

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43

L = 011100111101110111101011

u =

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(78)

1 1 3 3

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43

0 0

0 1 1 1 0 0

missing high bits

L = 011100111101110111101011

u =

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(79)

1 1 3 3

Elias-Fano - Encoding example

3 4 7 13 14 15 21 43

0 0

0 1 1 1 0 0

missing high bits

L = 011100111101110111101011

u = 1 1 0

1 1 1

0 0

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(80)

1 1 3 3

Elias-Fano - Encoding example

3 3 1 0 0 1 0 0

3 4 7 13 14 15 21 43

0 0

0 1 1 1 0 0

missing high bits

L = 011100111101110111101011

u = 1 1 0

1 1 1

0 0

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(81)

1 1 3 3

Elias-Fano - Encoding example

3 3 1 0 0 1 0 0

3 4 7 13 14 15 21 43

0 0

0 1 1 1 0 0

missing high bits

H = 1110 1110 10 0 0 10 0 0 L = 011100111101110111101011

u = 1 1 0

1 1 1

0 0

high low

lg n lg(u/n)

1 2 3 4 5 6 7 8

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1

2

(82)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

(83)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence

of length n drawn from a universe u.

(84)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

(85)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

(86)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

u

lg n

n + 2n

lg ( ^u+n _n ( ≈ n lg ^u+n _n

(87)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

(less than half a bit away [Elias, JACM 1974])

u

lg n

n + 2n

lg ( ^u+n _n ( ≈ n lg ^u+n _n

(88)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

(less than half a bit away [Elias, JACM 1974])

lg ( ^u+n _n ( ≈ n lg ^u+n _n

(89)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

(90)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

(91)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

access to each S[i] in O(1) worst-case

(92)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

access to each S[i] in O(1) worst-case

predecessor(x) = max{S[i] | S[i] < x}

successor(x) = min{S[i] | S[i] ≥ x}

u

lg n

( (

O worst-case

queries in

(93)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

access to each S[i] in O(1) worst-case

predecessor(x) = max{S[i] | S[i] < x}

successor(x) = min{S[i] | S[i] ≥ x}

u

lg n

( (

O worst-case

queries in

(94)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

access to each S[i] in O(1) worst-case

but…

predecessor(x) = max{S[i] | S[i] < x}

successor(x) = min{S[i] | S[i] ≥ x}

u

lg n

( (

O worst-case

queries in

(95)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

access to each S[i] in O(1) worst-case

need o(n) bits more to support fast rank/select primitives on bitvector H

but…

predecessor(x) = max{S[i] | S[i] < x}

successor(x) = min{S[i] | S[i] ≥ x}

u

lg n

( (

O worst-case

queries in

(96)

Elias-Fano - Properties

EF(S[0,n)) =

3 u

lg n

n + 2n bits

X is the set of all monotone sequence of length n drawn from a universe u.

X = ( ^u+n _n (

lg ( ^u+n _n ( ≈ n lg ^u+n _n

access to each S[i] in O(1) worst-case

need o(n) bits more to support fast rank/select primitives on bitvector H

but…

predecessor(x) = max{S[i] | S[i] < x}

successor(x) = min{S[i] | S[i] ≥ x}

u

lg n

( (

O worst-case

queries in

(97)

Clustered Elias-Fano Indexes

Every encoder represents each sequence individually.

No exploitation of redundancy.

1

(98)

Clustered Elias-Fano Indexes

Every encoder represents each sequence individually.

No exploitation of redundancy.

1

(99)

Clustered Elias-Fano Indexes

Every encoder represents each sequence individually.

No exploitation of redundancy.

1 Idea: encode clusters of posting lists.

(100)

Clustered Elias-Fano Indexes ²

cluster of posting lists

(101)

Clustered Elias-Fano Indexes ²

cluster of posting lists

unbounded universe

lg u bits

(102)

Clustered Elias-Fano Indexes ²

cluster of posting lists

reference list

unbounded universe

lg u bits

(103)

Clustered Elias-Fano Indexes ²

cluster of posting lists

reference list R

unbounded universe

lg u bits

(104)

Clustered Elias-Fano Indexes ²

cluster of posting lists

reference list R

unbounded universe

lg u bits

(105)

Clustered Elias-Fano Indexes ²

cluster of posting lists

reference list R

unbounded universe

lg u bits

R << u

lg R bits

VS

(106)

Clustered Elias-Fano Indexes ²

cluster of posting lists

reference list R

Problems

1. how to build clusters

2. how to synthesise the reference list

unbounded universe

lg u bits

R << u

lg R bits

VS

(107)

Clustered Elias-Fano Indexes ²

cluster of posting lists

reference list R

Problems

1. how to build clusters

2. how to synthesise the reference list

unbounded universe

lg u bits

R << u

lg R bits

VS

NP-hard problem

already for a simplified formulation.

(108)

Clustered Elias-Fano Indexes ³

Time VS Space tradeoffs by varying reference size

(109)

Clustered Elias-Fano Indexes ³

Time VS Space tradeoffs by varying reference size

(110)

Clustered Elias-Fano Indexes ³

Time VS Space tradeoffs by varying reference size

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)

(111)

Clustered Elias-Fano Indexes ³

Time VS Space tradeoffs by varying reference size

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)

(112)

Clustered Elias-Fano Indexes ³

Time VS Space tradeoffs by varying reference size

Always better than PEF (by up to 11%)

and better than BIC (by up to 6.25%) Much faster than BIC (103% on average) Slightly slower than PEF (20% on average)

(113)

Integer Data Structures

Elias-Fano matches the

information theoretic minimum.

1 n lg(u/n) + 2n + o(n) bits

(114)

Integer Data Structures

Elias-Fano matches the

information theoretic minimum.

•

O(1) random access

•

O(lg(u/n)) predecessor/successor

1 n lg(u/n) + 2n + o(n) bits

(115)

Integer Data Structures

Elias-Fano matches the

information theoretic minimum.

•

O(1) random access

•

O(lg(u/n)) predecessor/successor

Static succinct data structure.

NO dynamic updates.

1 n lg(u/n) + 2n + o(n) bits

(116)

Integer Data Structures

Elias-Fano matches the

information theoretic minimum.

•

O(1) random access

•

O(lg(u/n)) predecessor/successor

• vEB Trees [van Emde Boas, FOCS 1975]

• x/y-Fast Tries [Willard, IPL 1983]

• Fusion Trees [Fredman and Willard, JCSS 1993]

• Exponential Search Trees [Andersson and Thorup, JACM 2007]

• …

•

Dynamic

•

Most of them take optimal time

Static succinct data structure.

NO dynamic updates.

1 n lg(u/n) + 2n + o(n) bits

(117)

Integer Data Structures

Elias-Fano matches the

information theoretic minimum.

•

O(1) random access

•

O(lg(u/n)) predecessor/successor

• vEB Trees [van Emde Boas, FOCS 1975]

• x/y-Fast Tries [Willard, IPL 1983]

• Fusion Trees [Fredman and Willard, JCSS 1993]

• Exponential Search Trees [Andersson and Thorup, JACM 2007]

• …

•

Dynamic

•

Most of them take optimal time

Static succinct data structure.

NO dynamic updates.

1 n lg(u/n) + 2n + o(n) bits

O(n lg u) bits

(or even worse)

(118)

Integer Data Structures - Problems and Results 2

The (general) Dictionary problem

The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported.

• insert(x) inserts x in S

• delete(x) deletes x from S

• search(x) checks whether x belongs to S

• minimum() returns the minimum element of S

• maximum() returns the maximum element of S

• predecessor(x) returns max{y ∈ S : y < x}

• successor(x) returns min{y ∈ S : y ≥ x}

(119)

Integer Data Structures - Problems and Results 2

(120)

Integer Data Structures - Problems and Results 2

Given a list S of n sorted integer, support the following operations

• access(i) return the i-th smallest element of S

under the assumption that w ≤ lgγ n for some γ.

The Dynamic List Representation problem [Fredman and Saks, STC 1989]

(121)

Integer Data Structures - Problems and Results 2

Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model.

(122)

Integer Data Structures - Problems and Results 2

Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model.