Two equivalent problems

(1)

Two equivalent problems

 RMQ on integer arrays  RMQ on ±1 integer arrays

 LCA on a Cartesian Tree

 Euler tour of the Cartesian tree

 RMQ on the array of node-levels

 LCA on general trees  RMQ on ±1 integer arrays

 Euler tour of the Cartesian tree

 RMQ on the array of node-levels

(2)

Search for k-mismatches

Problem: Find longest match between P[i,…] and T[j,…]

CCGTACGATCAGTACAGTACAGTACTTTTTTAAACCGGAGACTACA

T

CCGAACTATC

P

Data Structure

1. Concatenate P and T into a string X = T$P

2. Construct a data structure on X that retrieves FAST the longest match between any pair of suffixes of X

LCA or LCP query

If O(1)  O(k) time

(3)

Suffix Tree & LCA

LCA

T#P = mississippi#si^$

1 2 4 6 8 10 13

12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# _pi#

2 1

s

i

ppi# ^ssip pi#

3

si

ssip ppi# pi#

1

# 14

$

ppi# ^ss^ip_pi#

13

$

Longest match(3,13)

(4)

SA

12 11 14 8 5 2 1 10 9 13 7 4 6 3

Lcp

0 1 1 1 4 0 0 1 0 2 2 1 3

Suffix Array & LCP  RMQ

T#P = mississippi#si

1 2 4 6 8 10 13

RMQ

^si_sippi#

sissippi#

ssippi#

ssissippi#

LCP

Longest match(3,13)

Surprisingly, also LCA  ^RMQ

(5)

The RMQ problem

RMQ_A(i,j) – returns the index of the smallest element in the subarray A[i..j].

10 2 12 1 12 13 21 15 14 10

RMQ(2,7) = 3

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

 Trivial solution: Precompute RMQ for every pair of indices.

This takes Q(n²) space, and O(1) query time

(6)

RMQ on a general array

10 25 22 34 7 19 9 12 26 16

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

4 0

2

1 3

5 7

6

9 8

Cartesian Tree

(7)

RMQ(i,j) = LCA(i,j) on Cartesian trees

10 25 22 34 7 19 9 12 26 16

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

4 0

2

1 3

5 7

6

9 8

(8)

Generic RMQ  LCA  RMQ

±1

LCA(u,v) = shallowest node between u and v during a depth first search traversal of T.

3

4 1 5

6 2 7

Euler tour3 2 3 1 4 1 7 1 3 5 6 5 3

1 2

3

4

6 9

10 8

5 7 11

12 0

LCA

_T

(4,6) = 3

Node at the lowest level

1 2 1 2 3 2 3 2 1 2 3 2 1

±1 array

Node Level

(9)

We are left with “ RMQ on ±1 array ”

RMQ_A(i,j) – returns the index of the smallest element in the subarray A[i..j].

10 11 12 11 12 13 14 15 14 13

RMQ(2,7) = 3

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

 Recall the trivial solution: Precompute RMQ for every pair of indices. This takes Q(n²) space, and O(1) query time

(10)

Sparse Table

 Preprocess sub arrays of len 2^k, for every k=0,1,…, log n

 M(i,j) = index of min value in A[i, i+ 2^j -1]

10 11 12 11 12 13 14 15 14 13

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

M(2,0)=2 M(2,1)=3

M(2,2)=3

M(2,3)=3

Total space is O(n log n) RMQ query ?

(11)

Querying the Sparse Table

log( ) k    j i   

a₁ ... ...

i j

2^k ^elements 2^k ^elements

Total space is O(n log n) RMQ query takes O(1) time

(12)

Bucketing



A’[i] = min in the i-th block of A.

A

... ... ... ... ...

blocks n

log 2n

l o g 2

n

A’[0] A’[i] A’[2n/logn]

… ...

B[0] B[i] B[2n/logn]



B’[i] is the position (index) of that min.

A’

(13)

Use the Bucketing

... ... ... ... ...

A’[0] A’[i] A’[2n/logn]

lo g 2 n



Preprocess A’ for RMQ using SparseTable

 Space is (2n/log n) * log (2n/log n) = O(n)

 RMQ queries on A’ take O(1) time



Preprocess every block of A’ for border RMQ

 Space is O(n), border RMQ take O(1) time.

RMQ(i,j) takes O(1) time, if i,j lie in distinct blocks

A’

(14)

In-block RMQ over ±1 arrays

There are normalized blocks Set

Table[BlockEnc, i, j]

= RMQ(i,j)

, _X ( , ) _Y ( , ) i j RMQ i j RMQ i j

 

3 4 5 6 5 4 5 6 5 4

0 1 2 3 2 1 2 3 2 1 +1 +1 +1 -1 -1 +1 +1 -1 -1

( )

O n

log 1

2 2 ( )

n

O n

  

 

  

DX = DY



^log²



^{( )}

O n n  O n

X Y

Table entries

(15)

LZ-parsing (gzip)

T = mississippi#

1 2 4 6 8 10

12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# _pi#

2 1

s

i

ppi# ^ssip pi#

3

si

ssip ppi# pi#

1

#

ppi# ^ss^ip_pi#

(16)

LZ-parsing (gzip)

T = mississippi#

1 2 4 6 8 10

12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# _pi#

2 1

s

i

ppi# ^ssip_pi#

si 3

ssip ppi# pi#

1

#

ppi# ^ss^ip_pi#

<ssip>

1. Longest repeated prefix of T[6,...]

2. Repeat is on the left of 6

It is on the path to 6

Leftmost occ

= 3 < 6

Leftmost occ

= 3 < 6

By maximality check only nodes

(17)

LZ-parsing (gzip)

T = mississippi#

1 2 4 6 8 10

12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# _pi#

2 1

s

i

ppi# ^ssip pi#

3

si

ssip ppi# pi#

1

#

ppi# ^ss^ip_pi#

2

2 9

3

4

3

min-leaf

 Leftmost copy

Parsing:

1. Scan T

2. Visit ST and stop when min-leaf ≥ current pos

Precompute the min descending leaf at every node in O(n) time.

Two equivalent problems