• Non ci sono risultati.

Exact string search

N/A
N/A
Protected

Academic year: 2021

Condividi "Exact string search"

Copied!
31
0
0

Testo completo

(1)

Dictionary search

Exact string search

Paper on Cuckoo Hashing

(2)

Exact String Search

Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them.

Hashing

(3)

Hashing with chaining

(4)

Key issue: a good hash function

Basic assumption: Uniform hashing

Avg #keys per slot = n * (1/m) = n/m

= (load factor)

(5)

Search cost

m = (n)

(6)

In practice

A trivial hash function is:

prime

(7)

A “provably good” hash is

 Each a i is selected at random in [0,m)

k 0 k 1 k 2 k r

≈log 2 m

r ≈ L / log 2 m

a 0 a 1 a 2 a r

K

a

prime

l = max string len m = table size

not necessarily: (...mod p) mod m

(8)

Cuckoo Hashing

A B C

E D

2 hash tables, and 2 random choices where an item can be

stored

(9)

A B C

E D

F

A running example

(10)

A B C F

E D

A running example

(11)

A B C F

E D

G

A running example

(12)

E G B C F

A D

A running example

(13)

Cuckoo Hashing Examples

A B C

E D F

G

Random (bipartite) graph: node=cell, edge=key

(14)

Natural Extensions

 More than 2 hashes (choices) per key.

 Very different: hypergraphs instead of graphs.

 Higher memory utilization

3 choices : 90+% in experiments

4 choices : about 97%

 2 hashes + bins of B-size.

 Balanced allocation and tightly O(1)-size bins

 Insertion sees a tree of possible evict+ins paths

but more insert time (and random access)

more memory

...but more local

(15)

Dictionary search

Making one-side errors

Paper on Bloom Filter

(16)

What does it mean to crawl ?

• What to crawl in the Web ?

• How deep to crawl a site ?

• How often to crawl a page ?

(17)

Crawling

How to keep track of the URLs visited by a crawler?

 URLs are long

 Check should be very fast

 No care about small errors (≈ page not crawled)

Bloom Filter

over crawled URLs

(18)

Searching with errors...

(19)
(20)
(21)

Problem: false positives

(22)

TTT 2

(23)

Not perfectly true but...

(24)

m/n = 8

Opt k = 5.45...

We do have an

explicit formula

for the optimal k

(25)
(26)
(27)

Dictionary search

Prefix-string search

Reading 3.1 and 5.2

(28)

Prefix-string Search

Given a dictionary D of K strings, of total length N,

store them in a way that we can efficiently support

prefix searches for a pattern P over them.

(29)

Trie: speeding-up searches

1

2 2

0

4

5

6

7

2 3

y

s

1 z

stile zyg

5 etic

ial ygy

aibelyite

czecin omo

Pro: O(p) search time

Cons: edge + node labels and tree structure

(30)

Front-coding: squeezing strings

http://checkmate.com/All_Natural/

http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html

http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html ...

0 http://checkmate.com/All_Natural/

33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html

25 Essence_Oils.html

25 Mineral_Bath_Crystals.html 38 Salt.html

33 Cream.html

33  45%

0 http://checkmate.com/All/Natural/Washcloth.html...

….systile syzygetic syzygial syzygy….

2 5 5

Gzip may be much better...

(31)

….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….

systile szaielyite CT

on a sample

2-level indexing

Disk

Internal

Memory A disadvantage:

Trade-off ≈ speed vs space (because of bucket size)

2 advantages:

Search ≈ typically 1 I/O

Space ≈ Front-coding over buckets

Riferimenti

Documenti correlati

Definition (the preorder induced by a separable capitalization factor). Let ≥ fe the usual preorder of the financial events. We shall give two proofs: algebraic and geometrical. the

▪ Each server processes 2 x data => 2 x execution time to compute local list. ▪ In terms

Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over

● ClassicAnalyzer: splits according to a set of rules (basically, non-letters) + lowercase + stopword removal + normalization..

Abstract: We study natural lepton mass matrices, obtained assuming the stability of physical flavour observables with respect to the variations of individual matrix elements.

The introduction of endoscopic endonasal techniques through the medial wall and with superior and inferior eyelid incisions, as used in blepharoplasty, have made this

Al momento circa 30 aziende conferiscono il latte a caseifici ed altrettante lo trasformano direttamente in azienda, mentre sono 6 le malghe sulle quali si produ- ce formaggio

images inaZude the point, i.e.. - Bv Prooosition 2, we have only to Drove that an o-regular function is c. o-regular iff ii) is true.. Then let f be a