• Non ci sono risultati.

Two equivalent problems

N/A
N/A
Protected

Academic year: 2021

Condividi "Two equivalent problems"

Copied!
17
0
0

Testo completo

(1)

Two equivalent problems

 RMQ on integer arrays  RMQ on ±1 integer arrays

 LCA on a Cartesian Tree

 Euler tour of the Cartesian tree

 RMQ on the array of node-levels

 LCA on general trees  RMQ on ±1 integer arrays

 Euler tour of the Cartesian tree

 RMQ on the array of node-levels

(2)

Search for k-mismatches

Problem: Find longest match between P[i,…] and T[j,…]

CCGTACGATCAGTACAGTACAGTACTTTTTTAAACCGGAGACTACA

T

CCGAACTATC

P

Data Structure

1. Concatenate P and T into a string X = T$P

2. Construct a data structure on X that retrieves FAST the longest match between any pair of suffixes of X

LCA or LCP query

If O(1)  O(k) time

(3)

Suffix Tree & LCA

LCA

T#P = mississippi#si$

1 2 4 6 8 10 13

12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# pi#

2 1

s

i

ppi# ssip pi#

3

si

ssip ppi# pi#

1

# 14

$

ppi# ssippi#

13

$

Longest match(3,13)

(4)

SA

12 11 14 8 5 2 1 10 9 13 7 4 6 3

Lcp

0 1 1 1 4 0 0 1 0 2 2 1 3

Suffix Array & LCP  RMQ

T#P = mississippi#si

1 2 4 6 8 10 13

RMQ

sisippi#

sissippi#

ssippi#

ssissippi#

LCP

LCP

Longest match(3,13)

Surprisingly, also LCA RMQ

(5)

The RMQ problem

RMQA(i,j) – returns the index of the smallest element in the subarray A[i..j].

10 2 12 1 12 13 21 15 14 10

RMQ(2,7) = 3

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

Trivial solution: Precompute RMQ for every pair of indices.

This takes Q(n2) space, and O(1) query time

(6)

RMQ on a general array

10 25 22 34 7 19 9 12 26 16

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

4 0

2

1 3

5 7

6

9 8

Cartesian Tree

(7)

RMQ(i,j) = LCA(i,j) on Cartesian trees

10 25 22 34 7 19 9 12 26 16

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

4 0

2

1 3

5 7

6

9 8

(8)

Generic RMQ  LCA  RMQ

±1

LCA(u,v) = shallowest node between u and v during a depth first search traversal of T.

3

4 1 5

6 2 7

Euler tour3 2 3 1 4 1 7 1 3 5 6 5 3

1 2

3

4

6 9

10 8

5 7 11

12 0

LCA

T

(4,6) = 3

Node at the lowest level

1 2 1 2 3 2 3 2 1 2 3 2 1

±1 array

Node Level

(9)

We are left with “ RMQ on ±1 array ”

RMQA(i,j) – returns the index of the smallest element in the subarray A[i..j].

10 11 12 11 12 13 14 15 14 13

RMQ(2,7) = 3

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

Recall the trivial solution: Precompute RMQ for every pair of indices. This takes Q(n2) space, and O(1) query time

(10)

Sparse Table

Preprocess sub arrays of len 2k, for every k=0,1,…, log n

M(i,j) = index of min value in A[i, i+ 2j -1]

10 11 12 11 12 13 14 15 14 13

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]

M(2,0)=2 M(2,1)=3

M(2,2)=3

M(2,3)=3

Total space is O(n log n) RMQ query ?

(11)

Querying the Sparse Table

log( ) k    j i   

a1 ... ...

i j

2k elements 2k elements

Total space is O(n log n) RMQ query takes O(1) time

(12)

Bucketing

A’[i] = min in the i-th block of A.

A

... ... ... ... ...

blocks n

log 2n

l o g 2

n

A’[0] A’[i] A’[2n/logn]

… ...

B[0] B[i] B[2n/logn]

B’[i] is the position (index) of that min.

A’

(13)

Use the Bucketing

... ... ... ... ...

A’[0] A’[i] A’[2n/logn]

lo g 2 n

Preprocess A’ for RMQ using SparseTable

Space is (2n/log n) * log (2n/log n) = O(n)

RMQ queries on A’ take O(1) time

Preprocess every block of A’ for border RMQ

Space is O(n), border RMQ take O(1) time.

RMQ(i,j) takes O(1) time, if i,j lie in distinct blocks

A’

(14)

In-block RMQ over ±1 arrays

There are normalized blocks Set

Table[BlockEnc, i, j]

= RMQ(i,j)

, X ( , ) Y ( , ) i j RMQ i j RMQ i j

 

3 4 5 6 5 4 5 6 5 4

0 1 2 3 2 1 2 3 2 1 +1 +1 +1 -1 -1 +1 +1 -1 -1

( )

O n

log 1

2 2 ( )

n

O n

DX = DY

log2

( )

O n n O n

X Y

Table entries

(15)

LZ-parsing (gzip)

T = mississippi#

1 2 4 6 8 10

12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# pi#

2 1

s

i

ppi# ssip pi#

3

si

ssip ppi# pi#

1

#

ppi# ssippi#

<m><i><s><si><ssip><pi>

(16)

LZ-parsing (gzip)

T = mississippi#

1 2 4 6 8 10

12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# pi#

2 1

s

i

ppi# ssippi#

si 3

ssip ppi# pi#

1

#

ppi# ssippi#

<ssip>

1. Longest repeated prefix of T[6,...]

2. Repeat is on the left of 6

It is on the path to 6

Leftmost occ

= 3 < 6

Leftmost occ

= 3 < 6

By maximality check only nodes

(17)

LZ-parsing (gzip)

T = mississippi#

1 2 4 6 8 10

12

11 8

5 2 1 10 9

7 4

6 3

0

4

# i

ppi#

ssi

mississippi#

1

p

i# pi#

2 1

s

i

ppi# ssip pi#

3

si

ssip ppi# pi#

1

#

ppi# ssippi#

<m><i><s><si><ssip><pi>

2

2 9

3

4

3

min-leaf

Leftmost copy

Parsing:

1. Scan T

2. Visit ST and stop when min-leaf ≥ current pos

Precompute the min descending leaf at every node in O(n) time.

Riferimenti

Documenti correlati

A computational implementation of the length formula yields the j-multiplicity with high probability, however it is quite slow.. In this talk, we will give a formula for

(a) Average LOS cumulative displacement of the area in accelerating creep captured by the satellite InSAR during the entire monitoring period, and (b) corresponding velocity plot up

Useful results for statistical applications The first convergence result provided in this section can be used for the construction of asymptotic confidence intervals for the

A LLEN , D.C., “John Donne’s Knowledge of Renaissance Medicine” in Essential Articles for the Study of John Donne’s Poetry, J.R.. A NDREASEN , N.J.C., John

View of the crystal packing of: (A) BE_I and (B) BE_IV (S enantiomer in green, R enantiomer in pink). For sake of clarity, only the set of coordinates having the highest

Expression levels of the synthetic plyGBS gene in chloroplasts were so high (more than 70% TSP, the highest foreign protein accumulation level ever obtained in a transgenic plant)

Ancora più complesso è il sistema delle impugnazioni. Qui il problema da risolvere deriva dalla circostanza che gli atti della Procura europea sono per definizione atti di un

Generalized log odds ratio analysis for the association in two-way contingency table PAGE 5.. 2 Notations and odds