Auto-completion Search
How it works
What’s the dictionary ?
Trie for the Dictionary
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
Pro: O(p) search time = path scan
Cons: edge + node labels + tree structure
Top-1
What’s the ranking/scoring of the answers ?
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
P = sy
8
2 1 4
8,1
Top-1: How to speed-up
How to compute the top-1 in O(1) time ?
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
P = sy
8
2 1 4
4
1 7
2
3
5
8,1
1Top-2
How to compute the top-2 in O(1) time ?
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
P = sy
8
2 1 4
4,2
1,4 7,6
2
3
5 1,7
Top-k in O(1) time, but k× space
Top-k: How to squeeze ?
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
P = sy
8
2 1 4
2
3
5
8 1
2 2
1 3
4 4
2 5
3 6
5 7 Score
String
Prefixed by P, proceed D&C
Top-k: How to squeeze ?
8 1
2 2
1 3
4 4
2 5
3 6
5 7 Score
String
Prefixed by P, proceed D&C
Let H be a max-heap of size k, keep also
min[H] and max[H]Initialize H with k pairs <-, NULL>
Given the range <L,R> (here <1,4>)
Compute max-score in Array[L,R] (pos. M, value m)
If m ≤ min[H], skip;
else:
Insert <m,string> in H;
If size(H)>k then remove min[H];
Recurse on <L,M-1> and <M+1,R>, if not empty.
Time: O(k) time, and space
L R RMQ-query
in O(1) time and O(n) space
Example for Top-2
4 1
2 2
1 3
8 4
2 5
3 6
5 7 Score
String
Consider this other array
Range : operations
[1,7]: H <8,4>; recurse on [1,3] and [5,7]
[1,3]: H={<8,4>} <4,1>; recurse on [1,0] and [2,3]
[5,7]: H={<8,4>,<4,1>} <5,7>; delete <4,1> from H, recurse on [5,6] and [8,7]
[2,3]: H={<8,4>,<5,7>} <2,2>; since min[H]=5, not insert in H [5,6]: H ={<8,4>,<5,7>} <3,6>; since min[H]=5, not insert in H
H = {<8,4> e <5,7>}
L R
A smarter approach
8 1
2 2
1 3
4 4
2 5
3 6
5 7 Score
String
Prefixed by P, proceed D&C
Let H be a max-heap, including items <val, string, [low,high]>
Compute max-score in Array[L,R] (pos. M, value m) i=0; insert <m, string[M], L, R> in H
While (i<k) do
Extract <x, string[X], Lx, Rx> from H, where x is max-value in H
Return String[X] as one of the top-k strings
Compute max-score in Array[Lx,X-1] (pos. M’, value m’)
insert <m’, string[M’], Lx, X-1>
Compute max-score in Array[X+1,Rx] (pos. M’’, value m’’)
insert <m’’, string[M’’], X+1, Rx>
i++;
Time: still O(k) time, and space
L R
Random access to postings lists
and other data types
A basic problem !
AbacoBattleCarColdCod ....
D
1 12 15 20 22....
Dog
Array of n skip pointers to an array of m integers
•
(log m) bits per pointer = (n log m) bits = 32 n bits.
•
it is effective for few pointers
100001000001001000100 ....
B
Array of n string pointers to strings of total length m
•
(n log m) bits = 32 n bits.
•
it is independent of string length
We aim at achieving ≈ n log(m/n) bits < n log m
Rank/Select
00101001010101011111110000011010101....
B
•
Rank
b(i) = number of b in B[1,i]
•
Select
b(i) = position of the i-th b in B
Rank1(6) = 2 Select1(3) = 8
m = |B|
n = #1
Do exist data structures that solve this problem in
O(1) query time and very small
extra
space (i.e. +o(m) bits)Wish to index the bit vector B (possibly
compressed).
The Bit-Vector Index: |B| + o(|
B|)
m = |B|
n = #1s
Goal. B is read-only, and the additional index takes o(m) bits.
00101001010101011 1111100010110101 0101010111000....
B
Z
8 18(absolute) Rank1
Setting Z = poly(log m) and z=(1/2) log m:
Extra space is + (m/Z) log m + (m/z) log Z + o(m)
+ O(m loglog m / log m) = o(m) bits
Rank time is O(1)
Term o(m) is crucial in practice, B is untouched (not compressed)
0000 1 0 .... ... ...
1011 2 1 ....
block pos #1
z
(bucket-relative) Rank1
4 5 8
Rank
There exists a Bit-Vector Index taking o(m) extra bits
and constant time for Rank/Select.
B is needed and read-only!
z = 3, w=2
Elias-Fano (B is not needed)
If w = log (m/n) and z = log n, where m = |B| and n = #1 then
- L takes n w = n log (m/n) bits
- H takes n 1s + n 0s = 2n bits
0 1 2 3 4 5 6 7
In unary
Rank1(i) on B Needs binary search over B
Select1(i) on B uses L and (Select1(H,i) – i) in +o(n) space
(
Select
1on H)
If you wish to play with Rank and Select
m/10 + n log (m/n)
Rank in 0.4 sec, Select in < 1 sec vs 32n bits of explicit pointers