• Non ci sono risultati.

Advanced Algorithms for Massive Datasets

N/A
N/A
Protected

Academic year: 2021

Condividi "Advanced Algorithms for Massive Datasets"

Copied!
13
0
0

Testo completo

(1)

Advanced Algorithms for Massive Datasets

Basics of Hashing

(2)

The Dictionary Problem

Definition. Let us given a dictionary S of n keys drawn from a universe U. We wish to design a (dynamic) data structure that supports the following operations:

membership(k)  checks whether k є S

insert(k)  S = S U {k}

delete(k)  S = S – {k}

(3)

Collision resolution: chaining

(4)

Key issue: a good hash function

Basic assumption: Uniform hashing

Avg #keys per slot = n * (1/m) = n/m

= a (load factor)

(5)

Search cost

m = W(n)

(6)

In summary...

Hashing with chaining:

O(1) search/update time in expectation

O(n) optimal space

Simple to implement

...but:

Space = m log2 n + n (log2 n + log2 |U|) bits

Bounds in expectation

Uniform hashing is difficult to guarantee

Open addressing

array chains

(7)

In practice

Typically we use simple hash functions:

prime

(8)

Enforce “goodness”

As in Quicksort for the selection of its pivot, select the h() at random

From which set we should draw h ?

(9)

An example of Universal hash

Each a

i

is selected at random in [0,m)

k0 k1 k2 kr

≈log2 m

r ≈ log2 U / log2 m

a0 a1 a2 ar

K

a

prime

U = universe of keys m = Table size

not necessarily: (...mod p) mod m

(10)

Simple and efficient universal hash

h

a

(x) = ( a*x mod 2

r

) div 2

r-t

• 0 ≤ x < |U| = 2

r

• a is odd

Few key issues:

 Consists of t bits, so m = 2

t

 Probability of collision is ≤ 1/2

t-1

(= 2/m)

(11)

Minimal Ordered Perfect Hashing

11

m = 1.25 n

n=12  m=15

The h

1

and h

2

are not perfect

Minimal, not minimum

= lexicographic rank

(12)

h(t) = [ g( h

1

(t) ) + g ( h

2

(t) ) ] mod n

h is perfect and ordered, no strings need to be stored 12

space is negligible for h1 and h2 and it is = m log n, for g

(13)

How to construct it

13

Term = edge, its vertices are given by h1 and h2 All g(v)=0; then assign g() by difference with known h() Acyclic  ok

No-Acycl 

regenerate hashes

Riferimenti

Documenti correlati

The plan of this paper is the following: After introducing some basic defi- nitions and notations to be used throughout the paper, in Section 2, we prove that if (H, ◦) is a

Let H be the Hecke algebra associated to the finitely generated Coxeter group (W, S).. On the

We prove an interpolation property of the best simultaneous approximations and we study the structure of the set of cluster points of the best simultaneous approximations on

 Worst case constant lookup time (as perfect hash).

Huffman codes can be made succinct in the representation of the codeword tree, and fast in (de)coding.. We store for any

Whichever is their probability, Huffman uses 1 bit for each symbol and thus takes n bits to encode a message of n symbols.. This is ok when the probabilities are almost the same,

Following the filter we have the key-component of this work, that is the Low Noise Amplifier, then the first Mixer, that translates the carrier frequency to a lower and

Following [78, 26, 104, 5, 48], our graph connectivity symbolic algorithms will work on sets of nodes, rather than edges, thus minimizing the number of variables in intermediate