• Non ci sono risultati.

... But in practice…

N/A
N/A
Protected

Academic year: 2021

Condividi "... But in practice…"

Copied!
27
0
0

Testo completo

(1)

Tries

(2)

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1 z

stile zyg

5 etic

ial ygy

aibelyite

czecin omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

Performance:

• Search ≈ O(|P|) time

• Space ≈ O(K + N)

(3)

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1 z

stile zyg

5 etic

ial ygy

aibelyite

czecin omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

... But in practice…

• Search: random memory accesses

• Space: len + pointers + strings

Performance:

• Search ≈ O(|P|) time

• Space ≈ O(K + N)

(4)

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

systile szaielyite

CT

on a sample

2-level indexing

Disk

Internal

Memory

2 limitations:

• Sampling rate ≈ lengths of sampled strings

• Trade-off ≈ speed vs space (because of bucket size)

2 advantages:

• Search ≈ typically 1 I/O

• Space ≈ Front-coding over buckets

(5)

An old idea: Patricia Trie

1

2 2

0

4

5

6

7

2 3

y

1 s

z

stile

zyg

5

etic ial

ygy

aibelyite

czecin omo

[Morrison, J.ACM 1968]

(6)

A new search

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0 y

s 1 z

s z

5 e

i

y

a c

o Search(P):

• Phase 1: tree navigation

• Phase 2: Compute LCP

• Phase 3: tree navigation

Three-phase search:

P = syzyyea

01

2 5

g < y

P’s position

Only 1 string is checked

Trie Space ≈ #strings, NOT their length

[Ferragina-Grossi, J.ACM 1999]

(7)

….Locality Preserving Front Coding….

PT

on all strings

2-level indexing

Disk

Internal Memory

A limitation is n < M Typically 1 I/O

What about n > M

(8)

The String B-tree

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

Search(P) PT

•O((p/B) logB n) I/Os

•O(occ/B) I/Os It is dynamic...

1 string checked : O(p/B) O(logB n) levels

+

Lexicographic position of P

[Ferragina-Grossi, J.ACM 1999]

Knuth, vol 3°, pag. 489: “elegant”

(9)

GA 1

5 6

3

6 4

0

4 5 6 7

2 3

AGA GCGC

AG GG

AG C

A G A G A

On Front-Coding…

AGAAGA 5 G3 C

0 GCGCAGA 6 G

4 GGA 6 GA

Knuth

In-order visit Path covering+

Front Coding

3

0

Compacted Trie

=

FC + tree structure What about other traversals ?

FC + ... is searchable

(10)

Why pre-order visit

In Front-coding the Lcp information is

encoded many times

GA 1

5 6

3

6 4

0

4 5 6 7

2 3

AGA GCGC

AG GG

AG C

A G A G A

3

0

AGAAGA 1 G3 C

4 GCGCAGA 1 G

3 GGA 1 GA

Coding Rear

(11)

Text Indexing

(12)

What do we mean by

“Indexing” ?

 Word-based indexes, here a notion of “word” must be devised !

» Inverted lists, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !

» Suffix Array, Suffix tree, String B-tree,...

(13)

Basic notation and facts

Occurrences of P in T = All suffixes of T having P as a prefix

SUF(T) = Sorted set of suffixes of T T = mississippi

mississippi

4,7

P = si

T[i,N]

iff P is a prefix of the i-th suffix of T (ie. T[i,N])

T i P

Pattern P occurs at position i of T

From substring search To prefix search

Reduction

(14)

The Suffix Tree

T# = mississippi#

2 4 6 8 10 12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# pi#

2 1

s

i

ppi# ssippi#

3

si

ssip ppi# pi#

1

#

ppi# ssippi#

Label = <pos,len>

Space: #nodes Search pattern P

Maximal repeated substring = node

(15)

The Suffix Array

Prop 1. All suffixes in SUF(T) with prefix P are contiguous.

P=si

T = mississippi#

# i#

ippi#

issippi#

ississippi#

mississippi#

pi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

SUF(T)

Suffix Array

• SA: N log2 N) bits

• Text T: N chars

 In practice, a total of 5N bytes SA

12 11 8 5 2 1 10 9 7 4 6 3

T = mississippi#

suffix pointer

5

Prop 2. Starting position is the lexicographic one of P.

(16)

Searching a pattern

Indirected binary search on SA: O(p) time per suffix cmp

T = mississippi#

SA

12 11 8 5 2 1 10 9 7 4 6 3

P = si

P is larger

2 accesses per step

(17)

Searching a pattern

Indirected binary search on SA: O(p) time per suffix cmp

T = mississippi#

SA

12 11 8 5 2 1 10 9 7 4 6 3

P = si

P is smaller

Suffix Array search

• O(log2 N) binary-search steps

Each step takes O(p) char cmp

overall, O(p log2 N) time

+

[Manber-Myers, ’90]

(18)

Locating the occurrences

T = mississippi

#

4 7

SA

12 11 8 5 2 1 10 9 7 4 6 3

si#

occ=2

12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6

3 si$

Suffix Array search

• O (p + log2 N + occ) time

where # <  < $

sissippi sippi

(19)

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Text mining

How long is the common prefix between T[i,...] and T[j,...] ?

• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j

Lcp

0 1 1 4 0 0 1 0 2 1 3

# i#

ippi#

issippi#

ississippi#

mississippi pi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

12 11 8 5 2 1 10 9 7 4 6 3

SA

Lcp(7,3) = 1

= min{2,1,3}

(20)

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Text mining

Does it exist a repeated substring of length ≥ L ?

• Maximal Lcp of a suffix is with its adjacent

• Search for Lcp[i] ≥ L

Lcp

0 1 1 4 0 0 1 0 2 1 3

# i#

ippi#

issippi#

ississippi#

mississippi pi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

12 11 8 5 2 1 10 9 7 4 6 3

SA

(21)

Lcp

0 1 1 4 0 0 1 0 2 1 3

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Text mining

Does exist a substring of length ≥ L occurring ≥ C times ?

• Exist ≥ C equal substrings of length ≥ L chars

• Exist ≥ C suffixes sharing a prefix of ≥ L chars

• These suffixes may be not contiguous, but...

• Their “block” has a common prefix of ≥ L chars

• Search for Lcp[i,i+C-2] whose entries are ≥ L

# i#

ippi#

issippi#

ississippi#

mississippi pi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

12 11 8 5 2 1 10 9 7 4 6 3

SA

L = 1, C = 4

(22)

How to construct SA from T ?

# i#

ippi#

issippi#

ississippi#

mississippi pi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

12 11 8 5 2 1 10 9 7 4 6 3

SA

Elegant but inefficient

Obvious inefficiencies:

(n2 log n) time in the worst-case

(n log n) cache misses or I/O faults

Input: T = mississippi#

(23)

The skew algorithm

The key problem: Compare efficiently two suffixes

Brute-force = (n) time per cmp, (n

2

log n) total

In order to sort the suffixes of S 1. Divide the suffixes of S in two groups

S0,2 = suffixes starting at positions 0 mod 3 or 2 mod 3

S1 = suffixes starting at positions 1 mod 3

2a. Sort recursively S

0,2

(they are 2n/3)

2b. Sort S

1

: suffix(3i+1) = S[3i+1]  suff(3i+2) 3. Merge the sorted S

0,2

with the sorted S

1

T(n) = O(split) + T(2n/3) + O(|S1|) + O(merge) = O(n)

(24)

Sort recursively S 0,2

We turn this problem into the SA-construction of a shorter string of length (2/3)n.

S=AAT GTG AGA TGA $$$

RadixSort all triplets that start at positions 0,2 mod 3

T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$}

Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT)

Assign lexicographic names (log n bits)

A$$=1, ATG=2, GA$=3,…

Build s

0,2

and encode it:

ATG TGA GAT GA$ TGT GAG ATG A$$

2 6 5 3 7 4 2 1

1 2 3 4 5 6 7 8 9 10 11 12 13

(25)

Sort recursively S 0,2

 Given

S=AAT GTG AGA TGA $$$

We have built:

s

0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$

enc(s0,2) = 2 6 5 3 7 4 2 1

It is SA

0,2

= [12, 9, 2, 11, 6, 8, 5, 3]

1 2 3 4 5 6 7 8 9 10 11 12 13

A suffix of s0,2

A suffix of enc(s0,2)

SA(enc(s0,2)) gives

SA

0,2

Lex-order is preserved

(26)

Sort S 1

We turn this problem into the sort of pairs

S=AAT GTG AGA TGA $$$

Key observation: suff(1) = <A, pos(2)> = <A,3>

suff(7) = <A, pos(8)> = <A,6>

SA

0,2

= [12, 9, 2, 11, 6, 8, 5, 3]

Suffix of S1

1 2 3 4 5 6 7 8 9 10 11 12 13

 SA

1

= [1, 7, 4, 10]

(27)

The merge step

To merge suffix s

i

in S

0,2

with suffix s

k

in S

1

, note that

If (i mod 3) = 2  si+1 and sk+1 belong to S0,2

If (i mod 3) = 0  si+2 and sk+2 belong to S0,2

their order can be derived from SA

0,2

in O(1) time

SA

1

SA

0,2

SA

T(n) = T(2n/3) + O(n) + O(merge) = O(n)

S=AAT GTG AGA TGA $$$

1 2 3 4 5 6 7 8 9 10 11 12 13

Riferimenti

Documenti correlati

Importantly, these two studies explore the ability of different kinds of Tree Kernel to perform the tasks of classifying argumentative stances (the first study employs PTK [16],

Gusfield also provides a characterization on suffix tree of supermaximal repeats. In particular he proves that supermaximal repeats label internal nodes v of T such that all

Upon startup a bridge sets all its link to PRE-FORWARDING and claims to be Root and designated bridge on every port When the information about root are expired the bridge tries

&#34; Nearly optimal pattern search cost = O(P/pB + lg B n) block transfers.. Space

Build the suffix tree of the string T

– Ad ogni passo viene aggiunto l'arco di peso minimo che collega un nodo già raggiunto dell'albero con uno non ancora raggiunto. Prim: Shortest connection networks and

• Algoritmo di [Baeza-Yates, Gonnet] con Suffix Array Algoritmo di [Baeza-Yates, Gonnet] con Suffix Array Analisi delle Prestazioni. Analisi

In this section we show that, if the BWT of a text T [1..n] has r runs, then we can have an index using O(r) space that not only efficiently finds the interval SA[sp..ep] of