Space and Time-Efﬁcient Data Structures for Massive Datasets

(1)

Space and Time-Efficient Data Structures

for Massive Datasets

Giulio Ermanno Pibiri

giulio.pibiri@di.unipi.it Supervisor

Rossano Venturini

Computer Science Department University of Pisa

(2)

High Level Thesis

Data Structures + Data Compression Faster Algorithms

Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective,

that support fast data extraction.

Data compression & Fast Retrieval

together.

(3)

Published Results

1. Clustered Elias-Fano Indexes

2. Dynamic Elias-Fano Representation

3. Efficient Data Structures for Massive N-Gram Datasets

(4)

Published Results

1. Clustered Elias-Fano Indexes Journal paper

Giulio Ermanno Pibiri and Rossano Venturini

ACM Transactions on Information Systems (TOIS), 2017

(5)

Published Results

Conference paper

Annual Symposium on Combinatorial Pattern Matching (CPM), 2017

(6)

Published Results

ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017

(7)

Published Results

http://pages.di.unipi.it/pibiri/

EVERYTHING that I do (papers, slides and code) is fully accessible at my page:

ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017

(8)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of

queries, such as: “return me all documents in which terms {t

1

,…,t

k

}

occur”.

(9)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

house is red

red is always

good the

the is boy hungry is

boy

house red is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

2

1

3

4

5

Lt₁=[1, 3]

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

Lt₂=[4, 5]

Lt₃=[1]

Lt₄=[2, 3]

Lt₅=[3, 5]

Lt₆=[1, 2, 3, 4, 5]

Lt₇=[1, 2, 4]

Lt₈=[2, 3, 5]

(10)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

q = {boy, is, the}

house is red

red is always

good the

boy

house red is the

always hungry

2

1

3

4

5

Lt₁=[1, 3]

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

Lt₂=[4, 5]

Lt₃=[1]

Lt₄=[2, 3]

Lt₅=[3, 5]

Lt₆=[1, 2, 3, 4, 5]

Lt₇=[1, 2, 4]

Lt₈=[2, 3, 5]

(11)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

q = {boy, is, the}

house is red

red is always

good the

boy

house red is the

always hungry

2

1

3

4

5

Lt₁=[1, 3]

t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

Lt₂=[4, 5]

Lt₃=[1]

Lt₄=[2, 3]

Lt₅=[3, 5]

Lt₆=[1, 2, 3, 4, 5]

Lt₇=[1, 2, 4]

Lt₈=[2, 3, 5]

(12)

Clustered Elias-Fano Indexes - TOIS’17

Every encoder represents each sequence individually.

No exploitation of redundancy.

(13)

Clustered Elias-Fano Indexes - TOIS’17

Every encoder represents each sequence individually.

No exploitation of redundancy.

Idea: encode clusters of posting lists.

(14)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

(15)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

reference list

R

(16)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

reference list

R

(17)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

reference list

R

log u bits

R << u

log R bits

VS

(18)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

reference list

R

log u bits

R << u

log R bits

VS

Problems

1. Build the clusters.

2. Synthesise the reference list.

NP-hard problem

already for a simplified formulation.

(19)

Clustered Elias-Fano Indexes - TOIS’17

(20)

Clustered Elias-Fano Indexes - TOIS’17

(21)

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)

(22)

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)

(23)

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%)

and better than BIC (by up to 6.25%) Much faster than BIC (103% on average) Slightly slower than PEF (20% on average)

(24)

(Integer) Dynamic Ordered Sets

A dynamic ordered set S is a data structure representing n keys and supporting the following operations:

• Insert(x) inserts x in S

• Delete(x) deletes x from S

• Search(x) checks whether x belongs to S

• Minimum() returns the minimum element of S

• Maximum() returns the maximum element of S

• Predecessor(x) returns max{y ∈ S : y < x}

• Successor(x) returns min{y ∈ S : y ≥ x}

(25)

(Integer) Dynamic Ordered Sets

In the comparison model this is solved optimally by any self-balancing tree data structure in O(log n) time and

O(n) space.

More efficient solutions there exist if the considered keys are integers drawn from a bounded universe of size u.

(26)

(Integer) Dynamic Ordered Sets

Challenge

How to optimally solve the integer dynamic ordered set problem in compressed space?

In the comparison model this is solved optimally by any self-balancing tree data structure in O(log n) time and

O(n) space.

More efficient solutions there exist if the considered keys are integers drawn from a bounded universe of size u.

(27)

Motivation

Integer Data Structures

• van Emde Boas Trees

• X/Y-Fast Tries

• Fusion Trees

• Exponential Search Trees

• …

Elias-Fano Encoding

• EF(S(n,u)) = n log(u/n) + 2n bits to encode an ordered integer

sequence S

• O(1) Access

• O(1 + log(u/n)) Predecessor

space + time

-

dynamic +

space +

static -

+ time

(28)

Motivation

Integer Data Structures

• X/Y-Fast Tries

• Fusion Trees

• …

Elias-Fano Encoding

sequence S

• O(1) Access

space + time

-

dynamic +

space +

static -

+ time

(29)

Motivation

Integer Data Structures

• X/Y-Fast Tries

• Fusion Trees

• …

Elias-Fano Encoding

sequence S

• O(1) Access

space + time

-

dynamic +

space +

static -

+ time

Can we grab the best from both?

?

(30)

For u = nγ, γ = (1):

• EF(S(n,u)) + o(n) bits

• O(1) Access

• O(min{1+log(u/n), loglog n}) Predecessor

Result 1

Dynamic Elias-Fano Representation - CPM’17

• O(1) Access

• O(1) Append (amortized)

Result 2

• O(log n / loglog n) Access

• O(log n / loglog n) Insert/Delete (amortized)

Result 3

(31)

For u = nγ, γ = (1):

• O(1) Access

Result 1

Dynamic Elias-Fano Representation - CPM’17

• O(1) Access

Result 2

Result 3

(32)

For u = nγ, γ = (1):

• O(1) Access

Result 1

Dynamic Elias-Fano Representation - CPM’17

• O(1) Access

Result 2

Result 3

(33)

For u = nγ, γ = (1):

• O(1) Access

Result 1

Dynamic Elias-Fano Representation - CPM’17

• O(1) Access

Result 2

Result 3

(34)

For u = nγ, γ = (1):

• O(1) Access

Result 1

Dynamic Elias-Fano Representation - CPM’17

• O(1) Access

Result 2

Result 3

(35)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

(36)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of

log² n mini blocks

T

(37)

Dynamic Elias-Fano Representation - CPM’17

S

log² n mini blocks

T T T T

(38)

Dynamic Elias-Fano Representation - CPM’17

S

log² n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log² n loglog n) bits

(39)

Dynamic Elias-Fano Representation - CPM’17

S

log² n mini blocks

• O(log² n loglog n) bits o(n) bits

(40)

Dynamic Elias-Fano Representation - CPM’17

S

Y

block of log² n mini blocks

(41)

Dynamic Elias-Fano Representation - CPM’17

S

Y

(42)

Dynamic Elias-Fano Representation - CPM’17

S

Y

(43)

Dynamic Elias-Fano Representation - CPM’17

S

Y P

(44)

Dynamic Elias-Fano Representation - CPM’17

Y is an Y-fast trie

P is a dynamic prefix-sums DS

• O(b) time

O(n / (b x log² n)) x O(log u) = o(n) bits each

upper level

S

Y P

(45)

Dynamic Elias-Fano Representation - CPM’17

• O(b) time

upper level

S

Y P

o(n) bits

(46)

Dynamic Elias-Fano Representation - CPM’17

• O(b) time

upper level

S

Y P

o(n) bits

+ some technicalities

(47)

N-grams

Strings of N words.

N typically ranges from 1 to 5.

Extracted from text using a sliding window approach.

(48)

N-grams

Strings of N words.

N typically ranges from 1 to 5.

Extracted from text using a sliding window approach.

(49)

N-grams

Strings of N words.

N typically ranges from 1 to 5.

Books

≈ 6% of the books ever published

Extracted from text using a sliding window approach.

(50)

N-grams

Strings of N words.

N typically ranges from 1 to 5.

N number of grams

1 ^24,359,473

2 667,284,771

3 7,397,041,901

4 1,644,807,896

5 1,415,355,596

More than 11 billion grams.

Books

≈ 6% of the books ever published

Extracted from text using a sliding window approach.

(51)

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(52)

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(53)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(54)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(55)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(56)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(57)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(58)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

• Millions of unigrams.

• Height 5: longer contexts.

• The number of siblings has a funnel-shaped distribution.

(59)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

1 2

3

(60)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

u/n by varying context-length k

1 2

3

(61)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

u/n by varying context-length k

1 2

3

(62)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

Intel Xeon E5-2630 v3, 2.4 GHz  193 GB of RAM, Linux 64 bits

C++ implementation gcc 5.4.1 with the highest

optimization setting

(63)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

(64)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

(65)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

(66)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Context-based ID Remapping

• reduces space by more than 36% on average you will notice this!

Test machine

(67)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

(68)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

will you notice this?

• brings approximately 30% more time

(69)

Compressed Tries with Context-based ID Remapping - SIGIR’17

(70)

Compressed Tries with Context-based ID Remapping - SIGIR’17

(71)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

(72)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

(73)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

2.5÷

5.2X

3.1÷

5.8X

5.5X

2X 2X 2X

2.5X 2.5X 3X

2.7X 2.8X

3.5X 2X

(74)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

2.5÷

5.2X

3.1÷

5.8X

5.5X

2X 2X 2X

2.5X 2.5X 3X

2.7X 2.8X

3.5X 2X

(75)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

• Elias-Fano Tries substantially outperform ALL previous solutions in both space and time.

• As fast as the state-of-the-art (KenLM) but more than twice smaller.

2.5÷

5.2X

3.1÷

5.8X

5.5X

2X 2X 2X

2.5X 2.5X 3X

2.7X 2.8X

3.5X 2X

(76)

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds

counting normalization interpolation 17 11

36

20 30

60

35 52

104

Tongrams - 1 Tongrams - 2 Tongrams - 4

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 16 15,039,323

44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

(77)

On going work (preliminary results)

seconds

36

20 30

60

35 52

104

seconds

36

60 51

138 KenLM

Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 16 15,039,323

44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

(78)

On going work (preliminary results)

seconds

36

20 30

60

35 52

104

seconds

36

60 51

138 KenLM

Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 16 15,039,323

44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

3.89X

(79)

On going work (preliminary results)

seconds

36

20 30

60

35 52

104

seconds

36

60 51

138 KenLM

Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16

seconds

26 22 46

171 153 261

KenLM Tongrams

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

3.89X

(80)

On going work (preliminary results)

seconds

36

20 30

60

35 52

104

seconds

36

60 51

138 KenLM

Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16

seconds

26 22 46

171 153 261

KenLM Tongrams

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

3.89X

6.22X

(81)

Planned work for this year

1. Conclude two journal extensions

TCS

VLDBJ

(82)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

TCS

VLDBJ

(83)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

TCS

VLDBJ

Inverted Indexes with false positives allowed.

(84)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

TCS

VLDBJ

Data structures for features repository.

(85)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

TCS

VLDBJ

Compressed tries based on double-arrays.

(86)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

3. 6 months abroad.

TCS

VLDBJ

Compressed tries based on double-arrays.

(87)