• Non ci sono risultati.

Compressed Indexes for Fast Search of Semantic Data

N/A
N/A
Protected

Academic year: 2022

Condividi "Compressed Indexes for Fast Search of Semantic Data"

Copied!
38
0
0

Testo completo

(1)

Compressed Indexes for Fast Search
 of Semantic Data

Raffaele Perego

ISTI-CNR
 Pisa, Italy

Giulio Ermanno Pibiri

ISTI-CNR
 Pisa, Italy

Rossano Venturini

The University of Pisa
 Pisa, Italy

The 10-th Italian Information Retrieval Workshop (IIR 2019) 17/09/2019

(2)

Resource Description Framework (RDF)

“RDF is a standard model for data interchange on the Web.”

Source: https://www.w3.org/RDF

Statements are encoded with triples:


Subject (S) - Predicate (P) - Object (O)

(3)

Resource Description Framework (RDF)

“RDF is a standard model for data interchange on the Web.”

Source: https://www.w3.org/RDF

Statements are encoded with triples:


Subject (S) - Predicate (P) - Object (O)

<http://example.name#BobSmith12> <http://xmlns.com/foaf/0.1/knows> <http://example.name#JohnDoe34>

“Bob Smith knows John Doe.”

(4)

The problem

Huge datasets: billions of triples.

Storage space is an issue:


compression is mandatory.

How to support triple selection patterns (with wildcards) efficiently?

(5)

The problem

Huge datasets: billions of triples.

Storage space is an issue:


compression is mandatory.

How to support triple selection patterns (with wildcards) efficiently?

<Bob Smith> <knows> <???>

<???> <???> John Doe

<Bob Smith> <???> <Sara Parker>

(6)

The problem

Huge datasets: billions of triples.

Storage space is an issue:


compression is mandatory.

How to support triple selection patterns (with wildcards) efficiently?

<Bob Smith> <knows> <???>

<???> <???> John Doe

<Bob Smith> <???> <Sara Parker>

1 wildcard:


SP?


S?O


?PO

2 wildcards:


S??


?P?


??O 3 wildcards:


???

0 wildcard:


SPO

(7)

State-of-the-art solutions

Too costly in terms of space.

• Materialize all possible S-P-O permutations (6 separate indexes).


• Do not use sophisticated compression techniques.


• Expensive additional indexes to support retrieval.

(8)

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:


we deal with datasets of integer triples.

S P O
 S P ?


S ? ?


? ? ?


? P O


? P ? S ? O


? ? O


Selection patterns

(9)

S-P-O order

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:


we deal with datasets of integer triples.

S P O
 S P ?


S ? ?


? ? ?


? P O


? P ? S ? O


? ? O


Selection patterns

(10)

P-O-S order S-P-O order

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:


we deal with datasets of integer triples.

S P O
 S P ?


S ? ?


? ? ?


? P O


? P ? S ? O


? ? O


Selection patterns

(11)

O-S-P order P-O-S order S-P-O order

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:


we deal with datasets of integer triples.

S P O
 S P ?


S ? ?


? ? ?


? P O


? P ? S ? O


? ? O


Selection patterns

(12)

O-S-P order P-O-S order S-P-O order

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:


we deal with datasets of integer triples.

S P O
 S P ?


S ? ?


? ? ?


? P O


? P ? S ? O


? ? O


Selection patterns

Store an integer trie data structure

for each permutation.

(13)

The Permuted Trie Index: organisation

0 1 2 3 4

(14)

The Permuted Trie Index: organisation

Common prefixes are encoded once.


Two integer sequences per level (nodes and pointers).


Symmetrically support all selection patterns with 1 and 2 wildcards.


Cache-friendly memory layout.

0 1 2 3 4

(15)

The Permuted Trie Index: organisation

Common prefixes are encoded once.


Two integer sequences per level (nodes and pointers).


Symmetrically support all selection patterns with 1 and 2 wildcards.


Cache-friendly memory layout.

Allows effective
 compression

0 1 2 3 4

(16)

The Permuted Trie Index: organisation

Common prefixes are encoded once.


Two integer sequences per level (nodes and pointers).


Symmetrically support all selection patterns with 1 and 2 wildcards.


Cache-friendly memory layout.

Allows effective
 compression

Fast retrieval

0 1 2 3 4

(17)

The Permuted Trie Index: organisation

Common prefixes are encoded once.


Two integer sequences per level (nodes and pointers).


Symmetrically support all selection patterns with 1 and 2 wildcards.


Cache-friendly memory layout.

Allows effective
 compression

Fast retrieval (1, 2, ?)

0 1 2 3 4

(18)

The Permuted Trie Index: organisation

Common prefixes are encoded once.


Two integer sequences per level (nodes and pointers).


Symmetrically support all selection patterns with 1 and 2 wildcards.


Cache-friendly memory layout.

Allows effective
 compression

Fast retrieval (1, 2, ?)

0 1 2 3 4

(19)

The Permuted Trie Index: organisation

Common prefixes are encoded once.


Two integer sequences per level (nodes and pointers).


Symmetrically support all selection patterns with 1 and 2 wildcards.


Cache-friendly memory layout.

Allows effective
 compression

Fast retrieval (1, 2, ?)

0 1 2 3 4

(1, 2, 0) (1, 2, 1)

(20)

Permutation Elimination Cross Compression

The Permuted Trie Index: refinements

1

2

(21)

Cross Compression

Fact: the same triple appears three times, but in different permutations.

(22)

Cross Compression

Fact: the same triple appears three times, but in different permutations.

We can represent the subjects in trie 1 by using the subjects in trie 2.

(23)

Cross Compression

Fact: the same triple appears three times, but in different permutations.

We can represent the subjects in trie 1 by using the subjects in trie 2.

P Oi

S1 … Sj Sn

Oi

S1 … Sj Sn

(24)

Cross Compression

Fact: the same triple appears three times, but in different permutations.

We can represent the subjects in trie 1 by using the subjects in trie 2.

P Oi

S1 … Sj Sn

Oi

S1 … Sj Sn

p

Represent Sj as its position p.

(25)

Cross Compression

Fact: the same triple appears three times, but in different permutations.

We can represent the subjects in trie 1 by using the subjects in trie 2.

P Oi

S1 … Sj Sn

Oi

S1 … Sj Sn

p

Represent Sj as its position p.

Why?

Number of children in Dbpedia.

(26)

Permutation Elimination

Fact: predicates are few, thus S?O returns only few matches.

(27)

Permutation Elimination

Fact: predicates are few, thus S?O returns only few matches.

We can pattern match S?O on the SPO trie, instead of the OSP trie.


Given a (s,o) pair: for each child p

i

of s,


check is o is a child of p

i

. If so, then (s,p

i

,o) is a match.

(28)

Permutation Elimination

Fact: predicates are few, thus S?O returns only few matches.

We can pattern match S?O on the SPO trie, instead of the OSP trie.


Given a (s,o) pair: for each child p

i

of s,


check is o is a child of p

i

. If so, then (s,p

i

,o) is a match.

Less than 6 checks are needed on average!

Number of children in Dbpedia.

(29)

Permutation Elimination

S P O
 S P ?


S ? ? S ? O


? ? ? SPO trie

OR

+

(30)

Permutation Elimination

? P O


? ? O

? P ?

Object-based retrieval OPS trie

S P O
 S P ?


S ? ? S ? O


? ? ? SPO trie

OR

+

(31)

Permutation Elimination

? P O


? ? O

? P ?

Object-based retrieval OPS trie

? P O


? ? O

? P ?

Predicate-based retrieval POS trie

S P O
 S P ?


S ? ? S ? O


? ? ? SPO trie

OR

+

(32)

Permutation Elimination

? P O


? ? O

? P ?

Object-based retrieval OPS trie

? P O


? ? O

? P ?

Predicate-based retrieval POS trie

S P O
 S P ?


S ? ? S ? O


? ? ? SPO trie

OR

We can eliminate a permutation, thus saving 1/3 of the space of the index.

+

(33)

Experiments: setting

Compiler

gcc 7.2.0 (with all optimizations) Machine

i7-7700 CPU (@3.6 GHz), 64 GB of RAM DDR3 (@2.133 GHz)
 Linux 4.4.0, 64 bits

Datasets

(34)

Experiments: C++ code

C++ code at https://github.com/jermp/rdf_indexes

(35)

Experiments: our solutions

Overall, 2Tp offers the best space/time tradeoff.

(36)

Our selected trade-off configuration substantially outperforms the tested
 competitors in both space and time.

Experiments: overall comparison

(37)

Conclusions

The triple indexing problem with pattern matching

can be solved efficiently in both time and space regards.

Our solution — the permuted trie index — achieves substantial performance improvement

against the best previous solutions.

Paper available at

https://arxiv.org/abs/1904.07619

C++ code available at

https://github.com/jermp/rdf_indexes

Cross-compression

Permutation-elimination

(38)

Any questions?

Thanks for your attention,

time, patience!

Riferimenti

Documenti correlati

We  measured an 0.87 mm ATLASGAL flux density of 1.10 ±0.16 Jy, after making a correction for an estimated 17% line contamination based on the integrated (non-maser) line

A role of small heat shock protein B8 (HspB8) in the autophagic removal of misfolded proteins responsible for neurodegenerative diseases. Chaperone-assisted selective autophagy

Dallo studio realizzato dalla World Tourism Organization (2014), emerge il ruolo del turismo a livello internazionale. L’Europa continua a rafforzarsi in ambito turistico: è

A detailed analysis of the residues involved in H-bonds with water molecules pointed out that the residues in positions 1, 3 and 7 (Y, E and E in PepE and Y, E and K

X-ray analysis (D) revealed an aggressive destructive growth pattern toward knee region with periosteal new bone formation at the site of injection (red arrow), but did not show sign

Abbreviations: Ac, acetate; Ala, alanine; Asp, aspartate; Arg, arginine; CHESS, chemical shift selective saturation standard; Cho, choline; ChoCC, choline containing compounds;

Our results suggest that both FM and KETO can significantly reduce cytokine concentrations (TNF-a, CXCL8 and IFN-c) in an ex vivo model in dairy cows and that there is