• Non ci sono risultati.

Space- and Time-Efficient Data Structures for Massive Datasets

N/A
N/A
Protected

Academic year: 2022

Condividi "Space- and Time-Efficient Data Structures for Massive Datasets"

Copied!
92
0
0

Testo completo

(1)

Giulio Ermanno Pibiri

giulio.pibiri@di.unipi.it Supervisor

Rossano Venturini

Department of Computer Science University of Pisa

Space- and Time-Efficient Data Structures

for Massive Datasets

(2)
(3)

Evidence

The increase of information

does not scale with technology.

(4)

Evidence

“Software is getting slower more rapidly than hardware becomes faster.”

The increase of information

does not scale with technology.

(5)

Evidence

“Software is getting slower more rapidly than hardware becomes faster.”

The increase of information does not scale with technology.

Even more relevant today!

(6)

Scenario

time

space Algorithms

EFFICIENCY

how much work is required by a program - less work

Data structures

PERFORMANCE

how quickly a program

does its work - faster work

(7)

Scenario

time

space Algorithms

EFFICIENCY

how much work is required by a program - less work

Data structures

PERFORMANCE

how quickly a program does its work - faster work

?

Data compression

(8)

Small vs. fast?

The dichotomy problem

(9)

Small vs. fast?

The dichotomy problem

Choose one.

(10)

Small vs. fast?

NO

The dichotomy problem

Choose one.

(11)

High level thesis

Data Structures + Data Compression Fast Algorithms

Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective,

that support fast data extraction.

Data Compression & Fast Retrieval together.

(12)

Achieved results

Journal paper

Clustered Elias-Fano Indexes

Giulio Ermanno Pibiri and Rossano Venturini

ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.

Dynamic Elias-Fano Representation

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.

Efficient Data Structures for Massive N-Gram Datasets

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

arXiv (CoRR), April 2018.

Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018.

Variable-Byte Encoding is Now Space-Efficient Too

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat

ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.

Fast Dictionary-based Compression for Inverted Indexes

Journal paper

Journal paper

(13)

Achieved results

Journal paper

Clustered Elias-Fano Indexes

Giulio Ermanno Pibiri and Rossano Venturini

ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.

Dynamic Elias-Fano Representation

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.

Efficient Data Structures for Massive N-Gram Datasets

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

arXiv (CoRR), April 2018.

Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018.

Variable-Byte Encoding is Now Space-Efficient Too

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat

ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.

Fast Dictionary-based Compression for Inverted Indexes

Journal paper

Journal paper

integer sequences

(14)

Achieved results

Journal paper

Clustered Elias-Fano Indexes

Giulio Ermanno Pibiri and Rossano Venturini

ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.

Dynamic Elias-Fano Representation

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.

Efficient Data Structures for Massive N-Gram Datasets

Conference paper Giulio Ermanno Pibiri and Rossano Venturini

arXiv (CoRR), April 2018.

Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018.

Variable-Byte Encoding is Now Space-Efficient Too

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat

ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.

Fast Dictionary-based Compression for Inverted Indexes

Journal paper

Journal paper

integer sequences

(15)

Problem 1

Consider a sorted integer sequence.

(16)

Problem 1

Consider a sorted integer sequence.

How to represent it as a bit-vector where each original integer is uniquely-decodable, using as few as possible

bits?

How to maintain fast decompression speed?

(17)

Problem 1

Consider a sorted integer sequence.

How to represent it as a bit-vector where each original integer is uniquely-decodable, using as few as possible

bits?

How to maintain fast decompression speed?

This is a difficult problem that has been studied since the the ’60.

(18)

Applications

Inverted indexes Databases

RDF indexing

Geo-spatial data Graph-compression

E-Commerce

(19)

Applications

Inverted indexes Databases

RDF indexing

Geo-spatial data Graph-compression

E-Commerce

(20)

Inverted indexes

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

(21)

Inverted indexes

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always hungry

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

(22)

Inverted indexes

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always hungry

{always, boy, good, house, hungry, is, red, the}

t1 t2 t3 t4 t5 t6 t7 t8

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

(23)

Inverted indexes

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always hungry

2

1

3

4

5

{always, boy, good, house, hungry, is, red, the}

t1 t2 t3 t4 t5 t6 t7 t8

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

(24)

Inverted indexes

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always hungry

2

1

3

4

5

{always, boy, good, house, hungry, is, red, the}

t1 t2 t3 t4 t5 t6 t7 t8

Lt1=[1, 3]

Lt2=[4, 5]

Lt3=[1]

Lt4=[2, 3]

Lt5=[3, 5]

Lt6=[1, 2, 3, 4, 5]

Lt7=[1, 2, 4]

Lt8=[2, 3, 5]

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

(25)

Inverted indexes

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always hungry

2

1

3

4

5

{always, boy, good, house, hungry, is, red, the}

t1 t2 t3 t4 t5 t6 t7 t8

Lt1=[1, 3]

Lt2=[4, 5]

Lt3=[1]

Lt4=[2, 3]

Lt5=[3, 5]

Lt6=[1, 2, 3, 4, 5]

Lt7=[1, 2, 4]

Lt8=[2, 3, 5]

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

(26)

Inverted indexes

Inverted indexes owe their popularity to the efficient resolution of queries, such as:

“return all documents in which terms {t1,…,tk} occur”.

(27)

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always hungry

{always, boy, good, house, hungry, is, red, the}

2

1

3

4

5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]

Lt3=[1]

Lt4=[2, 3]

Lt5=[3, 5]

Lt6=[1, 2, 3, 4, 5]

Lt7=[1, 2, 4]

Lt8=[2, 3, 5]

Inverted indexes

Inverted indexes owe their popularity to the efficient resolution of queries, such as:

“return all documents in which terms {t1,…,tk} occur”.

(28)

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always hungry

{always, boy, good, house, hungry, is, red, the}

2

1

3

4

5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]

Lt3=[1]

Lt4=[2, 3]

Lt5=[3, 5]

Lt6=[1, 2, 3, 4, 5]

Lt7=[1, 2, 4]

Lt8=[2, 3, 5]

Inverted indexes

Inverted indexes owe their popularity to the efficient resolution of queries, such as:

“return all documents in which terms {t1,…,tk} occur”.

Q = {boy, is, the}

(29)

house is red

red is always

good the

the is boy hungry is

boy house red

is the

always hungry

{always, boy, good, house, hungry, is, red, the}

2

1

3

4

5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]

Lt3=[1]

Lt4=[2, 3]

Lt5=[3, 5]

Lt6=[1, 2, 3, 4, 5]

Lt7=[1, 2, 4]

Lt8=[2, 3, 5]

Inverted indexes

Inverted indexes owe their popularity to the efficient resolution of queries, such as:

“return all documents in which terms {t1,…,tk} occur”.

Q = {boy, is, the}

(30)

Huge research corpora describing different space/time trade-offs.

Elias Gamma and Delta

Variable-Byte Family

Binary Interpolative Coding

Simple Family

PForDelta

QMX

Elias-Fano

Partitioned Elias-Fano

Many solutions

‘70

2014

(31)

Huge research corpora describing different space/time trade-offs.

Elias Gamma and Delta

Variable-Byte Family

Binary Interpolative Coding

Simple Family

PForDelta

QMX

Elias-Fano

Partitioned Elias-Fano

Many solutions

Space Time

Spectrum

Binary

Interpolative Coding

Variable-Byte Family

‘70

2014

(32)

Huge research corpora describing different space/time trade-offs.

Elias Gamma and Delta

Variable-Byte Family

Binary Interpolative Coding

Simple Family

PForDelta

QMX

Elias-Fano

Partitioned Elias-Fano

Many solutions

Space Time

Spectrum

Binary

Interpolative Coding

Variable-Byte Family

‘70

2014

(33)

Key research questions

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary

Interpolative Coding

Variable-Byte Family

(34)

Key research questions

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary

Interpolative Coding

Variable-Byte Family

Is it possible to design an encoding that is as small as

BIC and much faster?

1

(35)

Key research questions

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary

Interpolative Coding

Variable-Byte Family

Is it possible to design an encoding that is as small as

BIC and much faster?

1

Is it possible to design an encoding that is as fast as VByte and much smaller?

2

(36)

Key research questions

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary

Interpolative Coding

Variable-Byte Family

Is it possible to design an encoding that is as small as

BIC and much faster?

1

Is it possible to design an encoding that is as fast as VByte and much smaller?

2

What about both objectives at the same time?!

(37)

Idea 1 - Clustered inverted indexes (TOIS ’17)

Every encoder represents each sequence individually.

No exploitation of redundancy.

(38)

Idea 1 - Clustered inverted indexes (TOIS ’17)

Every encoder represents each sequence individually.

No exploitation of redundancy.

(39)

Idea 1 - Clustered inverted indexes (TOIS ’17)

Every encoder represents each sequence individually.

No exploitation of redundancy.

Encode clusters of inverted lists.

(40)

Idea 1 - Clustered inverted indexes (TOIS ’17)

Every encoder represents each sequence individually.

No exploitation of redundancy.

Encode clusters of inverted lists.

Always better than PEF (by up to 11%)

Much faster than BIC (~103%)

Space Time

(41)

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

(42)

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

(43)

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Encode dense regions with unary codes, sparse

regions with VByte.

(44)

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Encode dense regions with unary codes, sparse

regions with VByte.

Optimal partitioning in linear time and

(45)

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Encode dense regions with unary codes, sparse

regions with VByte.

Compression ratio Optimal partitioning in

linear time and

(46)

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Encode dense regions with unary codes, sparse

regions with VByte.

Compression ratio Query processing speed and sequential decoding Optimal partitioning in

linear time and

(47)

Idea 3 - Dictionary compression (WSDM ’19)

with M. Petri and A. Moffat (University of Melbourne)

If we consider subsequences of d-gaps in inverted lists, these are repetitive across the whole inverted index.

(48)

Idea 3 - Dictionary compression (WSDM ’19)

with M. Petri and A. Moffat (University of Melbourne)

If we consider subsequences of d-gaps in inverted lists, these are repetitive across the whole inverted index.

Put the top-k frequent patters in a dictionary of size k.

Then encode inverted lists as sequences of log k-bit codewords.

(49)

Idea 3 - Dictionary compression (WSDM ’19)

with M. Petri and A. Moffat (University of Melbourne)

If we consider subsequences of d-gaps in inverted lists, these are repetitive across the whole inverted index.

Put the top-k frequent patters in a dictionary of size k.

Then encode inverted lists as sequences of log k-bit codewords.

(50)

Idea 3 - Dictionary compression (WSDM ’19)

with M. Petri and A. Moffat (University of Melbourne)

If we consider subsequences of d-gaps in inverted lists, these are repetitive across the whole inverted index.

Put the top-k frequent patters in a dictionary of size k.

Then encode inverted lists as sequences of log k-bit codewords.

(51)

The bigger picture

(52)

The bigger picture

(53)

The bigger picture

(54)

Integer data structures

van Emde Boas Trees

X/Y-Fast Tries

Fusion Trees

Exponential Search Trees

EF(S(n,u)) = n log(u/n) + 2n bits to encode a sorted integer sequence S

O(1) Access

O(1 + log(u/n)) Predecessor

space + time

-

dynamic +

space +

static -

+ time

Elias-Fano encoding

Problem 2

(55)

Integer data structures

van Emde Boas Trees

X/Y-Fast Tries

Fusion Trees

Exponential Search Trees

EF(S(n,u)) = n log(u/n) + 2n bits to encode a sorted integer sequence S

O(1) Access

O(1 + log(u/n)) Predecessor

space + time

-

dynamic +

space +

static -

+ time

Can we grab the best from both?

Elias-Fano encoding

Problem 2

(56)

Dynamic inverted indexes

Classic solution: use two indexes.

One is big and cold; the other is small and hot.

Merge them periodically.

Append-only inverted indexes.

(57)

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits

O(1) Access

O(min{1+log(u/n), loglog n}) Predecessor

Integer dictionaries in succinct space (CPM ’17)

EF(S(n,u)) + o(n) bits

O(1) Access

O(1) Append (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

EF(S(n,u)) + o(n) bits

O(log n / loglog n) Access

O(log n / loglog n) Insert/Delete (amortized)

Result 1

Result 2

Result 3

(58)

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits

O(1) Access

O(min{1+log(u/n), loglog n}) Predecessor

Integer dictionaries in succinct space (CPM ’17)

EF(S(n,u)) + o(n) bits

O(1) Access

O(1) Append (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

EF(S(n,u)) + o(n) bits

O(log n / loglog n) Access

O(log n / loglog n) Insert/Delete (amortized)

Result 1

Result 2

Result 3

Optimal time bounds for all

operations

using a sublunar redundancy.

(59)

Problem 3

Consider a large text.

(60)

Problem 3

Consider a large text.

How to represent all its substrings of size 1 ≤ k ≤ N words for fixed N (e.g., N = 5), using as few as possible bits?

How to estimate the probability of occurrence of the patterns under a given probability model?

Fast Access to individual N-grams?

(61)

Problem 3

Consider a large text.

How to represent all its substrings of size 1 ≤ k ≤ N words for fixed N (e.g., N = 5), using as few as possible bits?

How to estimate the probability of occurrence of the patterns under a given probability model?

Fast Access to individual N-grams?

(62)

Applications

Next word prediction.

(63)

Applications

Next word prediction.

space and time-efficient ?

context

(64)

Applications

Next word prediction.

algorithms foo

data structures bar

baz

1214 2 3647

3 1

frequency count

space and time-efficient ?

context

(65)

Applications

Next word prediction.

algorithms foo

data structures bar

baz

1214 2 3647

3 1

frequency count

space and time-efficient ? context

f (“space and time-efficient data structures”) f (“space and time-efficient”)

P(“data structures” | “space and time-efficient”) ≈

(66)

What can I help you with?

Siri

(67)

Applications

(68)

Applications

(69)

Indexing

Books

~6% of the books ever published

n number of n-grams 1 24,359,473

2 667,284,771

3 7,397,041,901

4 1,644,807,896

5 1,415,355,596

More than 11

billion n-grams.

(70)

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

(71)

k = 1

Map a word ID to the position it takes within its sibling IDs (the IDs following a context of

fixed length k).

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

(72)

k = 1

Map a word ID to the position it takes within its sibling IDs (the IDs following a context of

fixed length k).

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

(73)

k = 1

Map a word ID to the position it takes within its sibling IDs (the IDs following a context of

fixed length k).

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

(74)

k = 1

Map a word ID to the position it takes within its sibling IDs (the IDs following a context of

fixed length k).

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

(75)

k = 1

Map a word ID to the position it takes within its sibling IDs (the IDs following a context of

fixed length k).

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

The (Elias-Fano) context-based

remapped trie is as fast as the fastest

(76)

k = 1

Map a word ID to the position it takes within its sibling IDs (the IDs following a context of

fixed length k).

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

The (Elias-Fano) context-based remapped trie is even smaller than the most space-efficient competitors, The (Elias-Fano) context-based

remapped trie is as fast as the fastest

(77)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

(78)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

(79)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

(80)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

(81)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

(82)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

(83)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

(84)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

(85)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

Rebuilding the last level of the trie.

(86)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

Rebuilding the last level of the trie.

A 4

B 2

C 2

X 4

(87)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

Rebuilding the last level of the trie.

A 4

B 2

C 2

X 4

A 1

B 5

C 7

X 9

(88)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

Rebuilding the last level of the trie.

A 4

B 2

C 2

X 4

A 1

B 5

C 7

X 9

(89)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

Rebuilding the last level of the trie.

A 4

B 2

C 2

X 4

A 1

B 5

C 7

X 9

(90)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

Rebuilding the last level of the trie.

A 4

B 2

C 2

X 4

A 1

B 5

C 7

X 9

(91)

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block

Rebuilding the last level of the trie.

A 4

B 2

C 2

X 4

A 1

B 5

C 7

X 9

Estimation runs 4.5X faster

(92)

Thanks for your attention, time, patience!

Any questions?

Riferimenti

Documenti correlati

Huffman codes can be made succinct in the representation of the codeword tree, and fast in (de)coding.. We store for any

Whichever is their probability, Huffman uses 1 bit for each symbol and thus takes n bits to encode a message of n symbols.. This is ok when the probabilities are almost the same,

Com o objetivo de caracterizar as regiões de elevada altitude de Santa Catarina e de explicar a adaptação da variedade Rebo (Vitis vinifera L.) a essas condições específicas,

Sunlight is the greatest source of environmental heating incident on most Earth- orbiting satellites. Since Earth's orbit around the Sun is slightly elliptical, the

The main concern with external sorting is to minimize disk access since reading a disk block takes about a million times longer than accessing an item in RAM (according to Shaffer

(In the PDM model defined in Chapter 2, we measure the block size B in units of items rather than in units of bytes.) In this figure, 8 KB is the indicated physical block transfer

That is, we employ the external- memory merge-sort algorithm from Section 3, but instead of using T WOWAY -M ERGE , we use the multiway merging (as described in the previous section)

Such a tree consists of a binary base tree T on the sorted set of interval endpoints, with the intervals stored in secondary structures associated with internal nodes of the