Tries
(Compacted) Trie
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
[Fredkin, CACM 1960]
(2; 3,5)
Performance:
• Search ≈ O(|P|) time
• Space ≈ O(K + N)
(Compacted) Trie
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
[Fredkin, CACM 1960]
(2; 3,5)
... But in practice…
• Search: random memory accesses
• Space: len + pointers + strings
Performance:
• Search ≈ O(|P|) time
• Space ≈ O(K + N)
….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….
systile szaielyite
CT
on a sample
2-level indexing
Disk
Internal
Memory
2 limitations:
• Sampling rate ≈ lengths of sampled strings
• Trade-off ≈ speed vs space (because of bucket size)
2 advantages:
• Search ≈ typically 1 I/O
• Space ≈ Front-coding over buckets
An old idea: Patricia Trie
1
2 2
0
4
5
6
7
2 3
y
1 s
z
stile
zyg5
etic ial
ygy
aibelyite
czecin omo
[Morrison, J.ACM 1968]
A new search
….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….
2 2
0 y
s 1 z
s z
5 e
i
y
a c
o Search(P):
• Phase 1: tree navigation
• Phase 2: Compute LCP
• Phase 3: tree navigation
Three-phase search:
P = syzyyea
01
2 5
g < y
P’s position
Only 1 string is checked
Trie Space ≈ #strings, NOT their length
[Ferragina-Grossi, J.ACM 1999]
….Locality Preserving Front Coding….
PT
on all strings
2-level indexing
Disk
Internal Memory
A limitation is n < M Typically 1 I/O
What about n > M
The String B-tree
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
PT PT PT
PT PT PT PT PT PT
Search(P) PT
•O((p/B) logB n) I/Os
•O(occ/B) I/Os It is dynamic...
1 string checked : O(p/B) O(logB n) levels
+
Lexicographic position of P
[Ferragina-Grossi, J.ACM 1999]
Knuth, vol 3°, pag. 489: “elegant”
GA 1
5 6
3
6 4
0
4 5 6 7
2 3
AGA GCGC
AG GG
AG C
A G A G A
On Front-Coding…
AGAAGA 5 G3 C
0 GCGCAGA 6 G
4 GGA 6 GA
Knuth
In-order visit Path covering+
Front Coding
3
0
Compacted Trie
=
FC + tree structure What about other traversals ?
FC + ... is searchable
Why pre-order visit
In Front-coding the Lcp information is
encoded many times
GA 1
5 6
3
6 4
0
4 5 6 7
2 3
AGA GCGC
AG GG
AG C
A G A G A
3
0
AGAAGA 1 G3 C
4 GCGCAGA 1 G
3 GGA 1 GA
Coding Rear
Text Indexing
What do we mean by
“Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted lists, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
Basic notation and facts
Occurrences of P in T = All suffixes of T having P as a prefix
SUF(T) = Sorted set of suffixes of T T = mississippi
mississippi
4,7
P = si
T[i,N]
iff P is a prefix of the i-th suffix of T (ie. T[i,N])
T i P
Pattern P occurs at position i of T
From substring search To prefix search
Reduction
The Suffix Tree
T# = mississippi#
2 4 6 8 10 12
11 8
5 2 1 10 9
7 4
6 3
0
4
# i
ppi#
ssi
mississippi#
1
p
i# pi#
2 1
s
i
ppi# ssippi#
3
si
ssip ppi# pi#
1
#
ppi# ssippi#
Label = <pos,len>
Space: #nodes Search pattern P
Maximal repeated substring = node
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
P=si
T = mississippi#
# i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
SUF(T)
Suffix Array
• SA: N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes SA
12 11 8 5 2 1 10 9 7 4 6 3
T = mississippi#
suffix pointer
5
Prop 2. Starting position is the lexicographic one of P.
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
T = mississippi#
SA
12 11 8 5 2 1 10 9 7 4 6 3
P = si
P is larger
2 accesses per step
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
T = mississippi#
SA
12 11 8 5 2 1 10 9 7 4 6 3
P = si
P is smaller
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+
[Manber-Myers, ’90]Locating the occurrences
T = mississippi
#4 7
SA
12 11 8 5 2 1 10 9 7 4 6 3
si#
occ=2
12 11 8 5 2 1 10 9 7 4 6 3 12 11 8 5 2 1 10 9 7 4 6
3 si$
Suffix Array search
• O (p + log2 N + occ) time
where # < < $
sissippi sippi
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Text mining
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j
Lcp
0 1 1 4 0 0 1 0 2 1 3
# i#
ippi#
issippi#
ississippi#
mississippi pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
12 11 8 5 2 1 10 9 7 4 6 3
SA
Lcp(7,3) = 1
= min{2,1,3}
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Text mining
• Does it exist a repeated substring of length ≥ L ?
• Maximal Lcp of a suffix is with its adjacent
• Search for Lcp[i] ≥ L
Lcp
0 1 1 4 0 0 1 0 2 1 3
# i#
ippi#
issippi#
ississippi#
mississippi pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
12 11 8 5 2 1 10 9 7 4 6 3
SA
Lcp
0 1 1 4 0 0 1 0 2 1 3
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Text mining
Does exist a substring of length ≥ L occurring ≥ C times ?
• Exist ≥ C equal substrings of length ≥ L chars
• Exist ≥ C suffixes sharing a prefix of ≥ L chars
• These suffixes may be not contiguous, but...
• Their “block” has a common prefix of ≥ L chars
• Search for Lcp[i,i+C-2] whose entries are ≥ L
# i#
ippi#
issippi#
ississippi#
mississippi pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
12 11 8 5 2 1 10 9 7 4 6 3
SA
L = 1, C = 4
How to construct SA from T ?
# i#
ippi#
issippi#
ississippi#
mississippi pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
12 11 8 5 2 1 10 9 7 4 6 3
SA
Elegant but inefficient
Obvious inefficiencies:
• (n2 log n) time in the worst-case
• (n log n) cache misses or I/O faults
Input: T = mississippi#
The skew algorithm
The key problem: Compare efficiently two suffixes
Brute-force = (n) time per cmp, (n
2log n) total
In order to sort the suffixes of S 1. Divide the suffixes of S in two groups
S0,2 = suffixes starting at positions 0 mod 3 or 2 mod 3
S1 = suffixes starting at positions 1 mod 3
2a. Sort recursively S
0,2(they are 2n/3)
2b. Sort S
1: suffix(3i+1) = S[3i+1] suff(3i+2) 3. Merge the sorted S
0,2with the sorted S
1T(n) = O(split) + T(2n/3) + O(|S1|) + O(merge) = O(n)
Sort recursively S 0,2
We turn this problem into the SA-construction of a shorter string of length (2/3)n.
S=AAT GTG AGA TGA $$$
RadixSort all triplets that start at positions 0,2 mod 3
T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$}
Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT)
Assign lexicographic names (log n bits)
A$$=1, ATG=2, GA$=3,…
Build s
0,2and encode it:
ATG TGA GAT GA$ TGT GAG ATG A$$
2 6 5 3 7 4 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13
Sort recursively S 0,2
Given
S=AAT GTG AGA TGA $$$
We have built:
s
0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$ enc(s0,2) = 2 6 5 3 7 4 2 1
It is SA
0,2= [12, 9, 2, 11, 6, 8, 5, 3]
1 2 3 4 5 6 7 8 9 10 11 12 13
A suffix of s0,2
A suffix of enc(s0,2)
SA(enc(s0,2)) gives
SA
0,2Lex-order is preserved
Sort S 1
We turn this problem into the sort of pairs
S=AAT GTG AGA TGA $$$
Key observation: suff(1) = <A, pos(2)> = <A,3>
suff(7) = <A, pos(8)> = <A,6>
SA
0,2= [12, 9, 2, 11, 6, 8, 5, 3]
Suffix of S1
1 2 3 4 5 6 7 8 9 10 11 12 13
SA
1= [1, 7, 4, 10]
The merge step
To merge suffix s
iin S
0,2with suffix s
kin S
1, note that
If (i mod 3) = 2 si+1 and sk+1 belong to S0,2
If (i mod 3) = 0 si+2 and sk+2 belong to S0,2
their order can be derived from SA
0,2in O(1) time
SA
1SA
0,2SA
T(n) = T(2n/3) + O(n) + O(merge) = O(n)
S=AAT GTG AGA TGA $$$
1 2 3 4 5 6 7 8 9 10 11 12 13