Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

(1)

Databases

DBMS Architecture:

Hashing Techniques (RDBMS)

(2)

References

▪ Hashing Techniques:

▪ Elmasri, 7th Ed. Chapter 16, section 8.

▪ Cormen, 3rd Ed. Chapter 11.

▪ Inverted indexing:

▪ Elmasri, 7th Ed. Chapter 27, section 5.

(3)

Hashing Techniques (1)

■ High efficient indexes for equality search.

■ They could be used both for internal or external files.

■ We could classify our techniques in other two types:

■ Static Hashing: for fixed size non-mutable data utilizzabile (e.g.

single session CD-ROM/DVD/BlueRay).

■ Hashing for Dynamic File Expansion: when both data and data sizes could vary in time.

■ As for the indices for secondary memory trees, hash indices

have use blocks for storing buckets.

(4)

Hashing Techniques (2)

▪ The information could be accessed fast: another type of primary file organization.

▪ Value search

▪ Bucketing (see hash join)

▪ We want to examine an arbitrary position in an array in O(1) time.

▪ For some hash data structures, in the worse case scenario the seek time is Θ(n)

▪ Allow to implement the dictionary operations (Insert,

Search, Delete).

(5)

Hashing Functions

▪ We want to store our information into a given number of records using blocks (and buckets).

▪ The search condition involves a (single) hash field.

▪ In most cases, such field is a key field of the file, in which case it is called hash key.

▪ We define a hashing function mapping each value into a range 0..M-1.

▪ e.g.: h(K)=K mod M

▪ If K is non-numeric, its byte representation is used

(6)

Collision Resolution

▪ It implies that some collisions may happed (exists k

and k’ s.t. h(k)=h(k’)) that have to be resolved (collision resolutions). Collision resolutions are the following:

▪ Chaining: the bucket array is exteded with overflow.

▪ Open addressing: from the occupied position, we seek the next free position.

▪ Multiple hashing: if one hash function generates a collision, we try with a next function…

▪ We’re going to see only the first technique.

(7)

Static Hashing

■ A fixed number of buckets M is allocated, and an hashing function mapping K to [0,…,M-1].

■ Each bucket has at least one block.

■ Each block is composed by m records.

■ E.g., H(K) returns the first/last M bits of K’s byte representation.

■ In this case, i = 1, N = 2

ⁱ

= 2, and each bucket contains a block

(8)

Static Hashing for Internal Files

(9)

Static Hashing for Internal Files

(10)

Static Hashing: searching 1100

0001 1001 1100 0

1 1100 H

■ The hash function computes the array index where the records

with a given hash field are stored

(11)

Ricerca di 1100 con hashing statico

0001 1001 1100 0

1 1100 H

■ The hash function computes the array index where the records

with a given hash field are stored

(12)

Static Hashing: Insert(1010)

0001 1001 1100 0

1 1010 H

■ We use the chaining method to extend a block with overflow

positions.

(13)

0001 1001 1100 0

1 1010 H

■ We use the chaining method to extend a block with overflow positions.

Static Hashing: Insert(1010)

(14)

0001 1001 1100 0

1 1010 H

1010

■ We use the chaining method to extend a block with overflow positions.

Static Hashing: Insert(1010)

(15)

How many blocks are there?

A √0 B 1

⁴

C 2 D 3

0 0001

(16)

How many buckets are there?

A √0 B 1

⁴

C 2 D 3

0001 1001 1100 0

1 1010

(17)

Static Hashing: Delete(1100)

0001 1001 1100 0

1 1100 H

1010

■ Overflows degradate static hashing's efficiency.

■ Value deletion could reduce the number of overflows.

(18)

Static Hashing: Delete(1100)

0001 1001 1100 0

1 1100 H

1010

■ Overflows degradate static hashing's efficiency.

■ Value deletion could reduce the number of overflows.

(19)

Static Hashing: Delete(1100)

0001 1001 1100 0

1 1100 H

1010

■ Overflows degradate static hashing's efficiency.

■ Value deletion could reduce the number of overflows.

(20)

Static Hashing: Delete(1100)

0001 1001 1010 0

1100 1 1010

■ Overflows degradate static hashing's efficiency.

■ Value deletion could reduce the number of overflows.

(21)

Static Hashing: Efficiency(?)

■ When no overflow occur, buckets could be accessed with a single access.

■ Then, m scans have to be performend if we want to reach a specific record.

■ Overflows cause a quick performance degradation

■ Efficency depends on:

■ Index size vs. Data size ratio, number of buckets.

■ Uniform hashing (e.g. H(k) is not a constant function…).

■ H(K) could vary at run time in order to reduce the number of overflows

(22)

Dynamic File Expansion

■ Collision resolution could be avoided by changing the number of the buckets dynamically.

■ Extendible Hashing: bucekts are used through an extendible “directory”, storing pointers to

buckets. Overflows are not used. The indexed buckets increase exponentially.

■ Linear Hashing: No “directory” are used.

Overflows are still used. The indexed buckets increase linearly.

■ Dynamic Hashing: (A precursor of extendible

(23)

External Hashing: a general example

(24)

Extendible Hashing (1)

■ Uses a level of indirection through an array of pointers to buckets (directory).

■ The directory grows by doubling their size. It has size 2

^d

, increasing linearly by d (global depth).

■ Each entry in the directory points to a single bucket, but each bucket can be accessed from multiple

entries.

■ Each bucket contains a d’ variable (local depth) that

indicates how many bits are actually used to index it.

(25)

Extendible Hashing (2)

■ The hash function refreshes alongside with d: it

returns the most significant d bits of the binary coded search key.

■ Extendible hashing doesn’t use overflow blocks, as

the data structure is dynamically updated.

(26)

Extendible Hashing: Search(1100)

0001 1

1001 1100

1 0

d=1

1100 H 1

global depth

(27)

0001 1

1001 1100

1 0

d=1

1100 H 1

global depth

Extendible Hashing: Search(1100)

(28)

Extendible Hashing: Insert

■ Retrive the bucket where to store the value, if there is enough room, store it!

■ Otherwise, get d’.

■ If d’ < d

■ The block is halved (halving).

■ Keys are distributed within the halved blocks using d’ + 1 bits.

■ d' is incremented to d'+1

■ The directory is updated with the pointer to the newly-created block

■ If d’ = d

■ d is incremented to d+1 (doubling).

■ Each directory entry is doubled, such that each entry w produces two entries, w₀ and w₁.

■ Continue as previously stated

(29)

Extendible Hashing: Insert(1010) (1)

1010

0001 1

1001 1100 1 1

0 d=1 H

■ Retrive the bucket where to store the value

(30)

1010

0001 1

1001 1100 1 1

0 d=1 H

■ Retrive the bucket where to store the value

There is no room!

Extendible Hashing: Insert(1010) (1)

(31)

1001 1100

1 0001 1 i=1

0 H

1

1 1010

■ Compare d' and d

1010 H

1

1 1010 H 0

1

1 1010

i=1 H 0

1

1 1010

0001 1 i=1

H 0

1

1 1010

0001 1 i=1

H 0

1

1 1010

0001 1 i=1

0 H

1

1 1010 1001

1100

1 0001 1 d=1

H 0 1010 1

Extendible Hashing: Insert(1010) (2)

(32)

1001 1100

1 0001 1 d=2

H 0 1010 1

Si incrementa i di 1

■ Increase d by 1

Extendible Hashing: Insert(1010) (3)

(33)

1001 1 0001 1 d=2

H 0 1010 1

■ Each directory entry is doubled, such that each entry w produces two entries, w

₀

and w

₁

.

01 00

Extendible Hashing: Insert(1010) (3)

(34)

Extendible Hashing: Insert(1010) (4)

0001 1

1001 1100

1 00

d=2

1010 01

10 11 10 11 01 10 11 00 01 00 10 01 00

11 10 01 00

■ The block is halved

(35)

0001 1

1001 1 00

d=2

1010 01

01 00 01 00 01 00 01 00

▪ The block is halved

Extendible Hashing: Insert(1010) (4)

(36)

0001 1

1001 1100

1 00

d=2

1010 01

10 11 SPLIT

10 11 01 10 11 00 01 00 10 01 00

11 10 01 00

▪The block is halved

▪Keys are distributed within the halved blocks using d’+1 bits.

1010

Extendible Hashing: Insert(1010) (4)

(37)

0001 1

1001 1 00

d=2

1010 01

01 00 01 00 01 00 01 00

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits

▪ d' is increased by one

1001 2

Extendible Hashing: Insert(1010) (4)

(38)

0001 1

1001 1100

1 00

d=2

1010 01

10 11 SPLIT

10 11 01 10 11 00 01 00 10 01 00

11 10 01 00

1010 1001 1010

2 ▪ The directory is updated with the pointer to the newly-created block

Extendible Hashing: Insert(1010) (4)

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits

▪ d' is increased by one

(39)

0001 1

1001 1010

2 00

d=2

0100 H 01

1100 2 10

11 ▪ Retrive the bucket where to store the value

Extendible Hashing: Insert(0100)

(40)

0001 1

1001 1010

2 00

d=2

0100 H 01

1100 2 10

11 ▪ Retrive the bucket where to store the value

Extendible Hashing: Insert(0100)

(41)

Extendible Hashing: Insert(0100)

0001 1

1001 1010

2 00

d=2

0100 H 01

1100 2 10

11 0100

▪ Retrive the bucket where to store the value

(42)

Extendible Hashing: Insert(0101) (1)

0001 0100

1 1001 1010

2 00

d=2

0101 H 01

1100 2 10

11 ▪ Retrive the bucket where to store the value

(43)

Extendible Hashing: Insert(0101) (1)

0001 0100

1 1001 1010

2 00

d=2

0101 H 01

10 ▪ Retrive the bucket where to store the value

(44)

d’ < d

Extendible Hashing: Insert(0101) (2)

Si confronta j con i

0001 0100

1 1001 1010

2 00

d=2

0101 H 01

1100 2 10

11 ▪ Compare d' and d

(45)

Extendible Hashing: Insert(0101) (3)

00 d=2

▪ The block is halved

0001 0100

1

(46)

Extendible Hashing: Insert(0101) (3)

1001 1010

2 00

d=2

0101 01

10 SPLIT

▪ The block is halved

0001 0100

1

(47)

Extendible Hashing: Insert(0101) (3)

00 d=2

SPLIT

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits.

0001 0100

1 1 0001 1

0100 0101

1

(48)

Extendible Hashing: Insert(0101) (3)

1001 1010

2 00

d=2

0101 01

10 SPLIT

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits.

▪ d' is increased by one

0001 0100

1 1 0001 1

0100 0101 0100 1 0101

2 0001 2

(49)

Extendible Hashing: Insert(0101) (3)

00 d=2

SPLIT

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’ + 1 bits.

▪ d' is increased by one

▪ The directory is updated with the pointer to the newly-created block

0001 0100

1 1 0001 1

0100 0101 0100 1 0101

2 0001 2

(50)

Hashing estendibile: inserimento 1000 (1)

0100 0101

2 1001 1010

2 00

d=2

1000 H 01

1100 2 10

11 0001 2

▪ Retrive the bucket where to store the value

(51)

Hashing estendibile: inserimento 1000 (1)

0100 0101

2 1001 1010

2 00

d=2

1000 H 01

10 11

0001 2

▪ Retrive the bucket where to store the value

(52)

▪ Compare d' and d

0100 0101

2 1001 1010

2 00

d=2

1000 H 01

1100 2 10

11 0001 2

Hashing estendibile: inserimento 1000 (2)

(53)

▪ Increase d by 1

0100 0101

2 1001 2 00

d=3

H 01

0001 2

Hashing estendibile: inserimento 1000 (3)

(54)

▪Increase d by 1

▪Each directory entry is doubled, such that each entry w produces two entries, w

₀

e w

₁

.

0100 0101

2 1001 1010

2 00

i=3

1000 H 01

10 11

0001 2

Hashing estendibile: inserimento 1000 (4)

d=3

(55)

0100 0101

2 1001 1010 000 2

d=3

1000 001

010 0001 2

000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001

▪ The block is halved

Hashing estendibile: inserimento 1000 (4)

(56)

0100 0101

2 1001 1010 000 2

d=3

1000 001

010 011

0001 2

2 SPLIT 000

001 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001

▪ The block is halved

Hashing estendibile: inserimento 1000 (4)

(57)

0100 0101

2 1001 1010 000 2

d=3

1000 001

010 0001 2

000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits.

1001 1000

2 Hashing estendibile: inserimento 1000 (4)

(58)

Hashing estendibile: inserimento 1000 (4)

0100 0101

2 1001 1010 000 2

d=3

1000 001

010 011

0001 2

2 SPLIT 000

001 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001

▪ d' is increased by one

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits.

1001 1000

2 1010 2 1001

1000

3 1010 3

(59)

0100 0101

2 1001 1000 000 3

d=3

1000 001

010 0001 2

1010 3 000

001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001

Hashing estendibile: inserimento 1000 (5)

▪ The directory is updated with the pointer to the

newly-created block

(60)

0100 0101

2 1001 1000 000 3

d=3

1000 001

010 011

0001 2

1010 3 100

000 001 010 000 001 011 010 000 001

100 011 010 000 001

Hashing estendibile: inserimento 1000 (5)

▪ The directory is updated with the pointer to the

newly-created block

(61)

How many blocks are there?

A 3 B 5 C 6

D 8 0100

0101

2 1001 1000 000 3

d=3

001 010 011

0001 2

1010 3

100

(62)

How many buckets are there?

A 3 B 5 C 6

D 8 0100

0101

2 1001 1000 000 3

d=3

001 010 011

0001 2

1010 3 100

101

(63)

Extendible hashing: discussion

Pros:

■ The space overhead for the directory is negligible

■ Splitting causes minor reorganizations, since only the records in one bucket are redistributed to the two new buckets (e.g.

records that start with the same bit sequence).

Cons:

■ Directory must be searched before accessing the buckets themselves: we have two block accesses (directory+data file buckets). This penalty is considered minor.

■ The exponential memory increase in first memory is quite

inefficient (it allocates more space than the one actually

(64)

Linear Hashing

■ Hash file expands and shrinks without needing a directory.

■ Overflow blocks are allowed.

■ The number of buckets increases linearly.

■ Such increase happens when (i) a overflow block is

inserted or (ii) when a specific record-bucket ratio (file load factor, r/n) exeeds a guard value ℓ (e.g. ℓ=1.7).

■ H

_i

(K) returns the less significant i bits.

(65)

Linear Hashing: Search(0101)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

0000 1111 0101 00

01 i=2 n=3

r=4

0101 H

₂

1010

r/n = 1.33

(66)

0000 1111 0101 00

01 i=2 n=3

r=4 0101

10 1010

r/n = 1.33

Linear Hashing: Search(0101)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

H

₂

(67)

0000 1111 0101 00

01 i=2 n=3

r=4 0101

1010

r/n = 1.33

Linear Hashing: Search(0101)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

H

₂

(68)

0000 1111 0101 00

01 i=2 n=3

r=4

r/n = 1.33

1111

10 1010

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

Linear Hashing: Search(1111)

H

₂

(69)

0000 1111 0101 00

01 i=2 n=3

r=4

r/n = 1.33

1111

1010

Linear Hashing: Search(1111)

H

₂

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(70)

0000 1111 0101 00

01 i=2 n=3

r=4

r/n = 1.33

1111

10 1010 3 – 2 (2-1)

= 1

₁₀

= 01

₂

Linear Hashing: Search(1111)

H

₂

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(71)

Linear Hashing: Insert (1)

■ r records, n buckets, file load factor (r/n), guard ℓ

■ When the file load factor is exceeded, a bucket gets splitted and a new n+1 bucket is created.

■ While using the hashing function H

_i

, all the buckets up to the 2

^i-1

-th are splitted.

■ When n>2

ⁱ

, H

_i

is changed with H

_i+1

and the buckets are splitted,

once again, from the first one…

(72)

■ if H

_i

(k) = m < n:

■ store k in the m-th bucket (use overflows when necessary)

■ Otherwise

■ store k in the (m - 2

^{i -1}

)-th bucket

■ Increase r, if the r/n>ℓ then:

■ if n = 2

ⁱ

, increase i by 1

■ (n)

₂

=a

₁

a

₂

… a

_i

having a

₁

=1

■ Clear the first bit in n and store it to m. a

₁

a

₂

… a

_i

→ 0a

₂

… a

_i

■ Allocate the n-th block

■ Move all the records from block m having the i-th rightmost bit as 1 into the n-th block.

■ Increase n

Linear Hashing: Insert (2)

(73)

Linear Hashing: Insert(0101) (1)

0000 1010 1111 0

1 i=1 n=2

r=3

r/n = 1.5

0101 H

₁

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(74)

0000 1010 1111 0

1 i=1 n=2

r=3

r/n = 1.5

0101

Linear Hashing: Insert(0101) (1)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

H

₁

(75)

0000 1010 1111 0

1 i=1 n=2

r=3

r/n = 1.5

0101

Linear Hashing: Insert(0101) (1)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

H

₁

(76)

0000 1010 1111 0

1 i=1 n=2

r=4

r/n = 1.5

0101

r/n = 2 >1.7

▪ Increase r

Linear Hashing: Insert(0101) (1)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

H

₁

(77)

r/n = 2

00 i=1 n=2

r=4 0000

1010

Linear Hashing: Insert(0101) (2)

(78)

r/n = 2

1111 0101 00

01 i=2 n=2

r=4

0101

▪ n = 2

¹

, increase i by 1 (i=2)

▪ (n)

₂

=10 (m)

₂

=00

0000 1010

Linear Hashing: Insert(0101) (2)

(79)

r/n = 2

00 i=2 n=2

r=4

▪ Allocate the n-th block, having (n)

₂

=10

0000 1010

Linear Hashing: Insert(0101) (2)

▪ n = 2

¹

, increase i by 1 (i=2)

▪ (n)

₂

=10 (m)

₂

=00

(80)

r/n = 2

1111 0101 00

01 n=2 r=4

0101

▪ Allocate the n-th block, having (n)

₂

=10

▪ 0000 1010

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

0000

Linear Hashing: Insert(0101) (2)

▪ n = 2

¹

, increase i by 1 (i=2)

▪ (n)

₂

=10 (m)

₂

=00

i=2

(81)

r/n = 2

00 n=3 r=4

▪ Allocate the n-th block, having (n)

₂

=10

0000 1010

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

▪ Increase n

0000

▪ n = 2

¹

, increase i by 1 (i=2)

▪ (n)

₂

=10 (m)

₂

=00

Linear Hashing: Insert(0101) (2)

i=2

(82)

r/n = 2

1111 0101 00

01 r=4

r/n = 1.33 < 1.7

0101

▪ Allocate the n-th block, having (n)

₂

=10

0000 1010 0000

Linear Hashing: Insert(0101) (2)

▪ n = 2

¹

, increase i by 1 (i=2)

▪ (n)

₂

=10 (m)

₂

=00

n=3 i=2

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

▪ Increase n

(83)

0000 1111 0101 00

01 i=2 n=3

r=4

r/n = 1.33

0001 H

Linear Hashing: Insert(0001)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(84)

Linear Hashing: Insert(0001)

0000 1111 0101 00

01 i=2 n=3

r=4

r/n = 1.33

0001 H

2

10 1010

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(85)

0000 1111 0101 00

01 i=2 n=3

r=4

r/n = 1.33

0001 H

Linear Hashing: Insert(0001)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(86)

0000 1111 0101 00

01 i=2 n=3

r=4

r/n = 1.33

0001 H

2

10 1010

0001

Linear Hashing: Insert(0001)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(87)

Linear Hashing: Insert(0001)

0000 1111 0101 00

01 i=2 n=3

r=5

r/n = 1.33

0001 H 0001

▪ Increase r

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(88)

0000 1111 0101 00

01 i=2 n=3

r=5

r/n = 1.33

0001 H

2

10 1010

0001

r/n = 1.66

< 1.7

▪ Increase r

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

Linear Hashing: Insert(0001)

(89)

r/n = 1.66

0000 1111 0101 00

01 i=2 n=3

r=5

0110 H 0001

Linear Hashing: Insert(0110) (1)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(90)

r/n = 1.66

0000 1111 0101 00

01 i=2 n=3

r=5

0110 H

2

10 1010

0001

Linear Hashing: Insert(0110) (1)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(91)

r/n = 1.66

0000 1111 0101 00

01 i=2 n=3

r=5

0110 H 0001

Linear Hashing: Insert(0110) (1)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(92)

r/n = 1.66

0000 1111 0101 00

01 i=2 n=3

r=6

r/n = 2 >1.7

0110 H

2

10 1010

0001

▪ Increase r

1010

Linear Hashing: Insert(0110) (1)

▪Given n buckets (2

^i-1

< n ≤ 2

ⁱ

).

▪H

_i

(K) = m < n K is in the m-th bucket.

▪H

_i

(K) = m ≥ n K is in the (m - 2

^i-1

)-th bucket.

(93)

0000 1111 00

i=2 n=3

r=6

r/n = 2

0001

Linear Hashing: Insert(0110) (2)

▪ n ≠ 2

²

▪ (n)

₂

=11 (m)

₂

=01

(94)

0000 1111 0101 00

01 i=2 n=3

r=6

r/n = 2

0110

0001

▪ Allocate the n-th block, having n = 11

₂

▪ n ≠ 2

²

▪ (n)

₂

=11 (m)

₂

=01

Linear Hashing: Insert(0110) (2)

(95)

0000 1111 00

i=2 n=3

r=6

r/n = 2

0001

▪ Allocate the n-th block, having n = 11

₂

0001

▪ n ≠ 2

²

▪ (n)

₂

=11 (m)

₂

=01

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

Linear Hashing: Insert(0110) (2)

(96)

0000 1111 0101 00

01 i=2 n=3

r=6

r/n = 2

0110

0001 0101

▪ Allocate the n-th block, having n = 11

₂

▪ n ≠ 2

²

▪ (n)

₂

=11 (m)

₂

=01

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

Linear Hashing: Insert(0110) (2)

(97)

0000 1111 00

i=2 n=4

r=6

r/n = 2

▪ Increase n

0001

▪ Allocate the n-th block, having n = 11

₂

▪ n ≠ 2

²

▪ (n)

₂

=11 (m)

₂

=01

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

Linear Hashing: Insert(0110) (2)

(98)

0000 1111 0101 00

01 i=2 n=4

r=6

r/n = 2

0110

r/n = 1.5 < 1.7

0001 0101

Linear Hashing: Insert(0110) (2)

▪ Increase n

▪ Allocate the n-th block, having n = 11

₂

▪ n ≠ 2

²

▪ (n)

₂

=11 (m)

₂

=01

▪ Move all the records from block m having the second rightmost

bit as 1 into the n-th block.

(99)

How many blocks are there?

A 2 B 3 C 4

D 5 ₀₀₀₀

1111 0101 00

01 i=2 n=3

r=5

10 1010

0001

(100)

How many buckets are there?

A 2 B 3 C 4

D 5 ₀₀₀₀

1111 0101 00

01 i=2 n=3

r=5

10 1010

0001

(101)

Information Retrieval

▪ Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

▪ These days we frequently think first of web search, but there are many other cases:

▪ E-mail search

▪ Searching your laptop

▪ Corporate knowledge bases

(102)

Basic Assumptions

▪ Document: unstructured file

▪ Collection: A set of documents

▪ Assume it is a static,

non-hypertext collection for the moment

▪ Query: set (of sequence of) of

keywords to express an information need

▪ Goal: Retrieve documents with

information that is relevant to the

user’s information need and helps

(103)

Inverted indexes

■ Inverted indexes are used to efficiently retrieve unstructured documents through fulltext queries. Such indices are “inverted”

because the index all the documents where a given word appears.

■ A document collection. D={d

₁

,…,d

_n

}

■ A vocabulary V is a set of distinct terms in the document set

■ An inverted index IX of a document collection is a vocabulary that attaches distinct terms with a list of all documents that contains the term.

for each v in V

(104)

Generating the Inverted Index (1)

vidi un magnifico disegno. Rappresentava un serpente boa nell’atto di inghiottire un

Voi che trovate tornando a casa il cibo caldo e visi amici

Considerate se questo è un uomo Una mattina, svegliandosi da sogni inquieti, Gregor Samsa si trovò nel suo letto trasformato in un insetto mostruoso D1

D2

D3

da 4

Gregor 7 in 15

… un 16

… a 5

casa 6 che 2

… animale 14

atto 10 boa 8

word number

(105)

a (D2,5)

animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4)

Gregor (D1,7)

da 4

Gregor 7 in 15

… un 16

… a 5

casa 6 che 2

… animale 14

Generating the Inverted Index (1)

(106)

Inverted Index: queries (1)

• Return documents containing “un” AND “atto”.

a (D2,5)

animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4)

Gregor (D1,7) in (D1,15)

un (D1,16), (D3,2), (D3,6), (D3,13) D1, D3 D3

∩ D3

(107)

Stemming

▪ Reduce terms to their “roots” before indexing and reduces the size of the inverted index.

▪ “Stemming” suggests crude postfix chopping

▪ language dependent

▪ e.g., automate(s), automatic, automation all reduced to automat.

for example compressed and compression are both

for exampl compress

and compress ar both

(108)