• Non ci sono risultati.

Space and Time-Efficient Data Structures for Massive Datasets

N/A
N/A
Protected

Academic year: 2022

Condividi "Space and Time-Efficient Data Structures for Massive Datasets"

Copied!
87
0
0

Testo completo

(1)

Space and Time-Efficient Data Structures

for Massive Datasets

Giulio Ermanno Pibiri

giulio.pibiri@di.unipi.it Supervisor

Rossano Venturini

Computer Science Department University of Pisa

(2)

High Level Thesis

Data Structures + Data Compression Faster Algorithms

Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective,

that support fast data extraction.

Data compression & Fast Retrieval

together.

(3)

Published Results

1. Clustered Elias-Fano Indexes

2. Dynamic Elias-Fano Representation

3. Efficient Data Structures for Massive N-Gram Datasets

(4)

Published Results

1. Clustered Elias-Fano Indexes Journal paper

Giulio Ermanno Pibiri and Rossano Venturini

ACM Transactions on Information Systems (TOIS), 2017

2. Dynamic Elias-Fano Representation

3. Efficient Data Structures for Massive N-Gram Datasets

(5)

Published Results

1. Clustered Elias-Fano Indexes Journal paper

Giulio Ermanno Pibiri and Rossano Venturini

ACM Transactions on Information Systems (TOIS), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini

Annual Symposium on Combinatorial Pattern Matching (CPM), 2017

2. Dynamic Elias-Fano Representation

3. Efficient Data Structures for Massive N-Gram Datasets

(6)

Published Results

1. Clustered Elias-Fano Indexes Journal paper

Giulio Ermanno Pibiri and Rossano Venturini

ACM Transactions on Information Systems (TOIS), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini

Annual Symposium on Combinatorial Pattern Matching (CPM), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini

ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017

2. Dynamic Elias-Fano Representation

3. Efficient Data Structures for Massive N-Gram Datasets

(7)

Published Results

http://pages.di.unipi.it/pibiri/

EVERYTHING that I do (papers, slides and code) is fully accessible at my page:

1. Clustered Elias-Fano Indexes Journal paper

Giulio Ermanno Pibiri and Rossano Venturini

ACM Transactions on Information Systems (TOIS), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini

Annual Symposium on Combinatorial Pattern Matching (CPM), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini

ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017

2. Dynamic Elias-Fano Representation

3. Efficient Data Structures for Massive N-Gram Datasets

(8)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of

queries, such as: “return me all documents in which terms {t

1

,…,t

k

}

occur”.

(9)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

house is red

red is always

good the

the is boy hungry is

boy

house red is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

2

1

3

4

5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]

Lt3=[1]

Lt4=[2, 3]

Lt5=[3, 5]

Lt6=[1, 2, 3, 4, 5]

Lt7=[1, 2, 4]

Lt8=[2, 3, 5]

(10)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

q = {boy, is, the}

house is red

red is always

good the

the is boy hungry is

boy

house red is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

2

1

3

4

5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]

Lt3=[1]

Lt4=[2, 3]

Lt5=[3, 5]

Lt6=[1, 2, 3, 4, 5]

Lt7=[1, 2, 4]

Lt8=[2, 3, 5]

(11)

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t

1

,…,t

k

} occur”.

q = {boy, is, the}

house is red

red is always

good the

the is boy hungry is

boy

house red is the

always hungry

T = {always, boy, good, house, hungry, is, red, the}

2

1

3

4

5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]

Lt3=[1]

Lt4=[2, 3]

Lt5=[3, 5]

Lt6=[1, 2, 3, 4, 5]

Lt7=[1, 2, 4]

Lt8=[2, 3, 5]

(12)

Clustered Elias-Fano Indexes - TOIS’17

Every encoder represents each sequence individually.

No exploitation of redundancy.

(13)

Clustered Elias-Fano Indexes - TOIS’17

Every encoder represents each sequence individually.

No exploitation of redundancy.

Idea: encode clusters of posting lists.

(14)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

(15)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

reference list

R

(16)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

reference list

R

(17)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

reference list

R

log u bits

R << u

log R bits

VS

(18)

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

reference list

R

log u bits

R << u

log R bits

VS

Problems

1. Build the clusters.

2. Synthesise the reference list.

NP-hard problem

already for a simplified formulation.

(19)

Clustered Elias-Fano Indexes - TOIS’17

(20)

Clustered Elias-Fano Indexes - TOIS’17

(21)

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)

(22)

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)

(23)

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%)

and better than BIC (by up to 6.25%) Much faster than BIC (103% on average) Slightly slower than PEF (20% on average)

(24)

(Integer) Dynamic Ordered Sets

A dynamic ordered set S is a data structure representing n keys and supporting the following operations:

Insert(x) inserts x in S

Delete(x) deletes x from S

Search(x) checks whether x belongs to S

Minimum() returns the minimum element of S

Maximum() returns the maximum element of S

Predecessor(x) returns max{y ∈ S : y < x}

Successor(x) returns min{y ∈ S : y ≥ x}

(25)

(Integer) Dynamic Ordered Sets

A dynamic ordered set S is a data structure representing n keys and supporting the following operations:

Insert(x) inserts x in S

Delete(x) deletes x from S

Search(x) checks whether x belongs to S

Minimum() returns the minimum element of S

Maximum() returns the maximum element of S

Predecessor(x) returns max{y ∈ S : y < x}

Successor(x) returns min{y ∈ S : y ≥ x}

In the comparison model this is solved optimally by any self-balancing tree data structure in O(log n) time and

O(n) space.

More efficient solutions there exist if the considered keys are integers drawn from a bounded universe of size u.

(26)

(Integer) Dynamic Ordered Sets

A dynamic ordered set S is a data structure representing n keys and supporting the following operations:

Insert(x) inserts x in S

Delete(x) deletes x from S

Search(x) checks whether x belongs to S

Minimum() returns the minimum element of S

Maximum() returns the maximum element of S

Predecessor(x) returns max{y ∈ S : y < x}

Successor(x) returns min{y ∈ S : y ≥ x}

Challenge

How to optimally solve the integer dynamic ordered set problem in compressed space?

In the comparison model this is solved optimally by any self-balancing tree data structure in O(log n) time and

O(n) space.

More efficient solutions there exist if the considered keys are integers drawn from a bounded universe of size u.

(27)

Motivation

Integer Data Structures

van Emde Boas Trees

X/Y-Fast Tries

Fusion Trees

Exponential Search Trees

Elias-Fano Encoding

EF(S(n,u)) = n log(u/n) + 2n bits to encode an ordered integer

sequence S

O(1) Access

O(1 + log(u/n)) Predecessor

space + time

-

dynamic +

space +

static -

+ time

(28)

Motivation

Integer Data Structures

van Emde Boas Trees

X/Y-Fast Tries

Fusion Trees

Exponential Search Trees

Elias-Fano Encoding

EF(S(n,u)) = n log(u/n) + 2n bits to encode an ordered integer

sequence S

O(1) Access

O(1 + log(u/n)) Predecessor

space + time

-

dynamic +

space +

static -

+ time

(29)

Motivation

Integer Data Structures

van Emde Boas Trees

X/Y-Fast Tries

Fusion Trees

Exponential Search Trees

Elias-Fano Encoding

EF(S(n,u)) = n log(u/n) + 2n bits to encode an ordered integer

sequence S

O(1) Access

O(1 + log(u/n)) Predecessor

space + time

-

dynamic +

space +

static -

+ time

Can we grab the best from both?

?

(30)

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits

O(1) Access

O(min{1+log(u/n), loglog n}) Predecessor

Result 1

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits

O(1) Access

O(1) Append (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits

O(log n / loglog n) Access

O(log n / loglog n) Insert/Delete (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 3

(31)

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits

O(1) Access

O(min{1+log(u/n), loglog n}) Predecessor

Result 1

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits

O(1) Access

O(1) Append (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits

O(log n / loglog n) Access

O(log n / loglog n) Insert/Delete (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 3

(32)

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits

O(1) Access

O(min{1+log(u/n), loglog n}) Predecessor

Result 1

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits

O(1) Access

O(1) Append (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits

O(log n / loglog n) Access

O(log n / loglog n) Insert/Delete (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 3

(33)

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits

O(1) Access

O(min{1+log(u/n), loglog n}) Predecessor

Result 1

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits

O(1) Access

O(1) Append (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits

O(log n / loglog n) Access

O(log n / loglog n) Insert/Delete (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 3

(34)

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits

O(1) Access

O(min{1+log(u/n), loglog n}) Predecessor

Result 1

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits

O(1) Access

O(1) Append (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits

O(log n / loglog n) Access

O(log n / loglog n) Insert/Delete (amortized)

O(min{1+log(u/n), loglog n}) Predecessor

Result 3

(35)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

(36)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of

log2 n mini blocks

T

(37)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of

log2 n mini blocks

T T T T

(38)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of

log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits

(39)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of

log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits o(n) bits

(40)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y

block of log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits o(n) bits

(41)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y

block of log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits o(n) bits

(42)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y

block of log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits o(n) bits

(43)

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y P

block of log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits o(n) bits

(44)

Dynamic Elias-Fano Representation - CPM’17

Y is an Y-fast trie

• O(loglog n) time

P is a dynamic prefix-sums DS

• O(b) time

O(n / (b x log2 n)) x O(log u) = o(n) bits each

upper level

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y P

block of log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits o(n) bits

(45)

Dynamic Elias-Fano Representation - CPM’17

Y is an Y-fast trie

• O(loglog n) time

P is a dynamic prefix-sums DS

• O(b) time

O(n / (b x log2 n)) x O(log u) = o(n) bits each

upper level

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y P

block of log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits

o(n) bits

o(n) bits

(46)

Dynamic Elias-Fano Representation - CPM’17

Y is an Y-fast trie

• O(loglog n) time

P is a dynamic prefix-sums DS

• O(b) time

O(n / (b x log2 n)) x O(log u) = o(n) bits each

upper level

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y P

block of log2 n mini blocks

T T T T T is a k-ary tree of constant height: lower level

• O(loglog n) time

• O(log2 n loglog n) bits

o(n) bits

o(n) bits

+ some technicalities

(47)

N-grams

Strings of N words.

N typically ranges from 1 to 5.

Extracted from text using a sliding window approach.

(48)

N-grams

Strings of N words.

N typically ranges from 1 to 5.

Extracted from text using a sliding window approach.

(49)

N-grams

Strings of N words.

N typically ranges from 1 to 5.

Books

≈ 6% of the books ever published

Extracted from text using a sliding window approach.

(50)

N-grams

Strings of N words.

N typically ranges from 1 to 5.

N number of grams

1 24,359,473

2 667,284,771

3 7,397,041,901

4 1,644,807,896

5 1,415,355,596

More than 11 billion grams.

Books

≈ 6% of the books ever published

Extracted from text using a sliding window approach.

(51)

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(52)

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(53)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(54)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(55)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(56)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(57)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

(58)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

• Millions of unigrams.

• Height 5: longer contexts.

• The number of siblings has a funnel-shaped distribution.

(59)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

• Millions of unigrams.

• Height 5: longer contexts.

• The number of siblings has a funnel-shaped distribution.

1 2

3

(60)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

u/n by varying context-length k

• Millions of unigrams.

• Height 5: longer contexts.

• The number of siblings has a funnel-shaped distribution.

1 2

3

(61)

k = 1

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).

Observation: the number of words following a given context is small.

u/n by varying context-length k

• Millions of unigrams.

• Height 5: longer contexts.

• The number of siblings has a funnel-shaped distribution.

1 2

3

(62)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

Intel Xeon E5-2630 v3, 2.4 GHz
 193 GB of RAM, Linux 64 bits

C++ implementation gcc 5.4.1 with the highest

optimization setting

(63)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

Intel Xeon E5-2630 v3, 2.4 GHz
 193 GB of RAM, Linux 64 bits

C++ implementation gcc 5.4.1 with the highest

optimization setting

(64)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

Intel Xeon E5-2630 v3, 2.4 GHz
 193 GB of RAM, Linux 64 bits

C++ implementation gcc 5.4.1 with the highest

optimization setting

(65)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine

Intel Xeon E5-2630 v3, 2.4 GHz
 193 GB of RAM, Linux 64 bits

C++ implementation gcc 5.4.1 with the highest

optimization setting

(66)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Context-based ID Remapping

• reduces space by more than 36% on average you will notice this!

Test machine

Intel Xeon E5-2630 v3, 2.4 GHz
 193 GB of RAM, Linux 64 bits

C++ implementation gcc 5.4.1 with the highest

optimization setting

(67)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Context-based ID Remapping

• reduces space by more than 36% on average you will notice this!

Test machine

Intel Xeon E5-2630 v3, 2.4 GHz
 193 GB of RAM, Linux 64 bits

C++ implementation gcc 5.4.1 with the highest

optimization setting

(68)

Compressed Tries with Context-based ID Remapping - SIGIR’17

Context-based ID Remapping

• reduces space by more than 36% on average you will notice this!

Test machine

Intel Xeon E5-2630 v3, 2.4 GHz
 193 GB of RAM, Linux 64 bits

C++ implementation gcc 5.4.1 with the highest

optimization setting

will you notice this?

• brings approximately 30% more time

(69)

Compressed Tries with Context-based ID Remapping - SIGIR’17

(70)

Compressed Tries with Context-based ID Remapping - SIGIR’17

(71)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

(72)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

(73)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

2.5÷

5.2X

3.1÷

5.8X

5.5X

2X 2X 2X

2.5X 2.5X 3X

2.7X 2.8X

3.5X 2X

(74)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

2.5÷

5.2X

3.1÷

5.8X

5.5X

2X 2X 2X

2.5X 2.5X 3X

2.7X 2.8X

3.5X 2X

(75)

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

• Elias-Fano Tries substantially outperform ALL previous solutions in both space and time.

• As fast as the state-of-the-art (KenLM) but more than twice smaller.

2.5÷

5.2X

3.1÷

5.8X

5.5X

2X 2X 2X

2.5X 2.5X 3X

2.7X 2.8X

3.5X 2X

(76)

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds

counting normalization interpolation 17 11

36

20 30

60

35 52

104

Tongrams - 1 Tongrams - 2 Tongrams - 4

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 2 Tongrams - 4 Tongrams - 8

Tongrams - 16 15,039,323

44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

(77)

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds

counting normalization interpolation 17 11

36

20 30

60

35 52

104

Tongrams - 1 Tongrams - 2 Tongrams - 4

seconds

counting normalization interpolation 17 11

36

60 51

138 KenLM

Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 2 Tongrams - 4 Tongrams - 8

Tongrams - 16 15,039,323

44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

(78)

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds

counting normalization interpolation 17 11

36

20 30

60

35 52

104

Tongrams - 1 Tongrams - 2 Tongrams - 4

seconds

counting normalization interpolation 17 11

36

60 51

138 KenLM

Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 2 Tongrams - 4 Tongrams - 8

Tongrams - 16 15,039,323

44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

3.89X

(79)

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds

counting normalization interpolation 17 11

36

20 30

60

35 52

104

Tongrams - 1 Tongrams - 2 Tongrams - 4

seconds

counting normalization interpolation 17 11

36

60 51

138 KenLM

Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16

seconds

26 22 46

171 153 261

KenLM Tongrams

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

3.89X

(80)

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds

counting normalization interpolation 17 11

36

20 30

60

35 52

104

Tongrams - 1 Tongrams - 2 Tongrams - 4

seconds

counting normalization interpolation 17 11

36

60 51

138 KenLM

Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246

——————

455,265,524 1.3 GB 233,035,325 total words

seconds

26 22

46 42 32

77

62 53 112

84 104

179

155 148

297 Tongrams - 1

Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16

seconds

26 22 46

171 153 261

KenLM Tongrams

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741

——————

863,966,768 3.2 GB 495,527,349 total words

3.89X

6.22X

(81)

Planned work for this year

1. Conclude two journal extensions

TCS

VLDBJ

(82)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

TCS

VLDBJ

(83)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

TCS

VLDBJ

Inverted Indexes with false positives allowed.

(84)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

TCS

VLDBJ

Inverted Indexes with false positives allowed.

Data structures for features repository.

(85)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

TCS

VLDBJ

Inverted Indexes with false positives allowed.

Compressed tries based on double-arrays.

Data structures for features repository.

(86)

Planned work for this year

1. Conclude two journal extensions

2. Develop other research ideas

3. 6 months abroad.

TCS

VLDBJ

Inverted Indexes with false positives allowed.

Compressed tries based on double-arrays.

Data structures for features repository.

(87)

Thanks for your attention, time, patience!

Any questions?

Riferimenti

Documenti correlati

 Create a classification model based on the logistic regression algorithm for textual documents..  Apply the model to new

La persona non si riconosce più in un’immagine ma si rappresen- ta a raverso un’infi nita serie di immagini, pro- prio come, allo stesso modo, lo spazio e le cose si

Use the word given in brackets to form a word that fits grammatically in the gap. They all stood up to the NATIONAL anthem, waving their country’s flag. Looking for work is

The springs were observed from May 2015 to April 2016, measuring flow discharge, temperature, pH and electrical conductivity every twenty days, with two samplings

Likewise in rodents, after complete spinal cord injury (SCI) the lower motor neuron (LMN) denervated human muscle fibers lose completely the myofibrillar

when temperature is lower than **** some of the RIS/RRA trains are called in operation and correspondingly reactor primary pumps **** are shutdown; when temperature is lower

Analysis of the response of innovative neutron detectors with monoenergetic neutron beams - 2012 - Poster presentation at ION BEAMS ’12 – Multidisciplinary Applications of