... But in practice…

(1)

Tries

(2)

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1 z

stile zyg

5 etic

ial ygy

aibelyite

czecin omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

Performance:

• Search ≈ O(|P|) time

• Space ≈ O(K + N)

(3)

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1 z

stile zyg

5 etic

ial ygy

aibelyite

czecin omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

... But in practice…

• Search: random memory accesses

• Space: len + pointers + strings

Performance:

• Search ≈ O(|P|) time

• Space ≈ O(K + N)

(4)

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

systile szaielyite

CT

on a sample

2-level indexing

Disk

Internal

Memory

2 limitations:

• Sampling rate ≈ lengths of sampled strings

• Trade-off ≈ speed vs space (because of bucket size)

2 advantages:

• Search ≈ typically 1 I/O

• Space ≈ Front-coding over buckets

(5)

An old idea: Patricia Trie

1

2 2

0

4

5

6

7

2 3

y

1 s

z

stile

zyg

5

etic ial

ygy

aibelyite

czecin omo

[Morrison, J.ACM 1968]

(6)

A new search

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0 y

s 1 z

s z

5 e

i

y

a c

o Search(P):

• Phase 1: tree navigation

• Phase 2: Compute LCP

• Phase 3: tree navigation

Three-phase search:

P = syzyyea

01

2 5

g < y

P’s position

Only 1 string is checked

Trie Space ≈ #strings, NOT their length

[Ferragina-Grossi, J.ACM 1999]

(7)

….Locality Preserving Front Coding….

PT

on all strings

2-level indexing

Disk

Internal Memory

A limitation is n < M Typically 1 I/O

What about n > M

(8)

The String B-tree

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

Search(P) PT

•O((p/B) log_B n) I/Os

•O(occ/B) I/Os It is dynamic...

1 string checked : O(p/B) O(log_B n) levels

+

Lexicographic position of P

[Ferragina-Grossi, J.ACM 1999]

Knuth, vol 3°, pag. 489: “elegant”

(9)

GA 1

5 6

3

6 4

0

4 5 6 7

2 3

AGA GCGC

AG GG

AG C

A G A G A

On Front-Coding…

AGAAGA 5 G3 C

0 GCGCAGA 6 G

4 GGA 6 GA

Knuth

In-order visit Path covering+

Front Coding

3

0

Compacted Trie

=

FC + tree structure What about other traversals ?

FC + ... is searchable

(10)

Why pre-order visit

In Front-coding the Lcp information is

encoded many times

GA 1

5 6

3

6 4

0

4 5 6 7

2 3

AGA GCGC

AG GG

AG C

A G A G A

3

0

AGAAGA 1 G3 C

4 GCGCAGA 1 G

3 GGA 1 GA

Coding Rear

(11)

Text Indexing

(12)

What do we mean by

“Indexing” ?

 Word-based indexes, here a notion of “word” must be devised !

» Inverted lists, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !

» Suffix Array, Suffix tree, String B-tree,...

(13)

Basic notation and facts

Occurrences of P in T = All suffixes of T having P as a prefix

SUF(T) = Sorted set of suffixes of T T = mississippi

mississippi

4,7

P = si

T[i,N]

iff P is a prefix of the i-th suffix of T (ie. T[i,N])

T ⁱ ^P

Pattern P occurs at position i of T

From substring search To prefix search

Reduction

(14)

The Suffix Tree

T# = mississippi#

2 4 6 8 10 12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# _pi#

2 1

s

i

ppi# ^ssip_pi#

3

si

ssip ppi# pi#

1

#

ppi# ^ss^ip_pi#

Label = <pos,len>

Space: #nodes Search pattern P

Maximal repeated substring = node

(15)

The Suffix Array

Prop 1. All suffixes in SUF(T) with prefix P are contiguous.

P=si

T = mississippi#

# i#

ippi#

issippi#

ississippi#

mississippi#

pi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

SUF(T)

Suffix Array

• SA: ^^{N log}2 N) bits

• Text T: N chars

 In practice, a total of 5N bytes SA

12 11 8 5 2 1 10 9 7 4 6 3

T = mississippi#

suffix pointer

5

Prop 2. Starting position is the lexicographic one of P.

(16)

Searching a pattern

Indirected binary search on SA: O(p) time per suffix cmp

T = mississippi#

SA

12 11 8 5 2 1 10 9 7 4 6 3

P = si

P is larger

2 accesses per step

(17)

Searching a pattern

Indirected binary search on SA: O(p) time per suffix cmp

T = mississippi#

SA

12 11 8 5 2 1 10 9 7 4 6 3

P = si

P is smaller

Suffix Array search

• O^(log2 N) binary-search steps

• Each step takes O(p) char cmp

 overall, O(p log₂ N) time

+

[Manber-Myers, ’90]

(18)

Locating the occurrences

T = mississippi

^#

4 7

SA

12 11 8 5 2 1 10 9 7 4 6 3

si#

occ=2

12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6

3 si$

Suffix Array search

• O (p + log₂ N + occ) time

where # <  ^{< $}

sissippi sippi

(19)

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Text mining

• How long is the common prefix between T[i,...] and T[j,...] ?

• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j

Lcp

0 1 1 4 0 0 1 0 2 1 3

# i#

ippi#

issippi#

ississippi#

mississippi pi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

12 11 8 5 2 1 10 9 7 4 6 3

SA

Lcp(7,3) = 1

= min{2,1,3}

(20)

Text mining

• Does it exist a repeated substring of length ≥ L ?

• Maximal Lcp of a suffix is with its adjacent

• Search for Lcp[i] ≥ L

Lcp

0 1 1 4 0 0 1 0 2 1 3

# i#

ippi#

issippi#

ississippi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

12 11 8 5 2 1 10 9 7 4 6 3

SA

(21)

Lcp

0 1 1 4 0 0 1 0 2 1 3

Text mining

Does exist a substring of length ≥ L occurring ≥ C times ?

• Exist ≥ C equal substrings of length ≥ L chars

• Exist ≥ C suffixes sharing a prefix of ≥ L chars

• These suffixes may be not contiguous, but...

• Their “block” has a common prefix of ≥ L chars

• Search for Lcp[i,i+C-2] whose entries are ≥ L

# i#

ippi#

issippi#

ississippi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

12 11 8 5 2 1 10 9 7 4 6 3

SA

L = 1, C = 4

(22)

How to construct SA from T ?

# i#

ippi#

issippi#

ississippi#

ppi#

sippi#

sissippi#

ssippi#

ssissippi#

12 11 8 5 2 1 10 9 7 4 6 3

SA

Elegant but inefficient

Obvious inefficiencies:

• (n² log n) time in the worst-case

• (n log n) cache misses or I/O faults

Input: T = mississippi#

(23)

The skew algorithm



The key problem: Compare efficiently two suffixes



Brute-force = (n) time per cmp, (n

²

log n) total

In order to sort the suffixes of S 1. Divide the suffixes of S in two groups

 S^0,2= suffixes starting at positions 0 mod 3 or 2 mod 3

 S¹= suffixes starting at positions 1 mod 3

2a. Sort recursively S

^0,2

(they are 2n/3)

2b. Sort S

¹

: suffix(3i+1) = S[3i+1]  suff(3i+2) 3. Merge the sorted S

^0,2

with the sorted S

¹

T(n) = O(split) + T(2n/3) + O(|S¹|) + O(merge) = O(n)

(24)

Sort recursively S ^0,2

We turn this problem into the SA-construction of a shorter string of length (2/3)n.

S=AAT GTG AGA TGA $$$



RadixSort all triplets that start at positions 0,2 mod 3

 T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$}

 Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT)



Assign lexicographic names (log n bits)

 A$$=1, ATG=2, GA$=3,…



Build s

^0,2

and encode it:

 ATG TGA GAT GA$ TGT GAG ATG A$$

 2 6 5 3 7 4 2 1

1 2 3 4 5 6 7 8 9 10 11 12 13

(25)

Sort recursively S ^0,2

 Given

S=AAT GTG AGA TGA $$$



We have built:



s

^0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$

 enc(s^0,2) = 2 6 5 3 7 4 2 1



It is SA

^0,2

= [12, 9, 2, 11, 6, 8, 5, 3]

1 2 3 4 5 6 7 8 9 10 11 12 13

A suffix of s^0,2

A suffix of enc(s^0,2)

SA(enc(s^0,2)) gives

SA

^0,2

Lex-order is preserved

(26)

Sort S ¹

We turn this problem into the sort of pairs

S=AAT GTG AGA TGA $$$

Key observation: suff(1) = <A, pos(2)> = <A,3>

suff(7) = <A, pos(8)> = <A,6>

SA

^0,2

= [12, 9, 2, 11, 6, 8, 5, 3]

Suffix of S¹

1 2 3 4 5 6 7 8 9 10 11 12 13

 SA

¹

= [1, 7, 4, 10]

(27)

The merge step

To merge suffix s

_i

in S

^0,2

with suffix s

_k

in S

¹

, note that

 If (i mod 3) = 2  s_i+1and s_k+1 belong to S^0,2

 If (i mod 3) = 0  s_i+2and s_k+2 belong to S^0,2

their order can be derived from SA

^0,2

in O(1) time

SA

¹

SA

^0,2

SA

T(n) = T(2n/3) + O(n) + O(merge) = O(n)

S=AAT GTG AGA TGA $$$

1 2 3 4 5 6 7 8 9 10 11 12 13

... But in practice…

Tries

(Compacted) Trie

Performance:

(Compacted) Trie

... But in practice…

Performance:

CT

2-level indexing

Disk

2 limitations:

2 advantages:

An old idea: Patricia Trie

stile

A new search

PT

2-level indexing

Disk

A limitation is n < M Typically 1 I/O

What about n > M

The String B-tree

+

On Front-Coding…

Knuth

Front Coding

=

Why pre-order visit

Coding Rear

Text Indexing

What do we mean by

“Indexing” ?

Basic notation and facts

4,7

T i P

The Suffix Tree

T# = mississippi#

0

4

1

2 1

3

1

The Suffix Array

T = mississippi#

T = mississippi#

Searching a pattern

Indirected binary search on SA: O(p) time per suffix cmp

T = mississippi#

Searching a pattern

Indirected binary search on SA: O(p) time per suffix cmp

T = mississippi#

+

Locating the occurrences

T = mississippi

where # <  < $

Text mining

Lcp(7,3) = 1

= min{2,1,3}

Text mining

Text mining

L = 1, C = 4

How to construct SA from T ?

Input: T = mississippi#

The skew algorithm

The key problem: Compare efficiently two suffixes

Brute-force = (n) time per cmp, (n

log n) total

In order to sort the suffixes of S 1. Divide the suffixes of S in two groups

2a. Sort recursively S

(they are 2n/3)

2b. Sort S

: suffix(3i+1) = S[3i+1]  suff(3i+2) 3. Merge the sorted S

with the sorted S

Sort recursively S 0,2

We turn this problem into the SA-construction of a shorter string of length (2/3)n.

S=AAT GTG AGA TGA $$$

RadixSort all triplets that start at positions 0,2 mod 3

Assign lexicographic names (log n bits)

Build s

and encode it:

T ⁱ ^P

where # <  ^{< $}

Sort recursively S ^0,2

Sort recursively S ^0,2

Sort S ¹