• Non ci sono risultati.

Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

N/A
N/A
Protected

Academic year: 2021

Condividi "Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)"

Copied!
110
0
0

Testo completo

(1)

Databases

DBMS Architecture:

Hashing Techniques (RDBMS)

(2)

References

▪ Hashing Techniques:

▪ Elmasri, 7th Ed. Chapter 16, section 8.

▪ Cormen, 3rd Ed. Chapter 11.

▪ Inverted indexing:

▪ Elmasri, 7th Ed. Chapter 27, section 5.

(3)

Hashing Techniques (1)

■ High efficient indexes for equality search.

■ They could be used both for internal or external files.

■ We could classify our techniques in other two types:

■ Static Hashing: for fixed size non-mutable data utilizzabile (e.g.

single session CD-ROM/DVD/BlueRay).

■ Hashing for Dynamic File Expansion: when both data and data sizes could vary in time.

■ As for the indices for secondary memory trees, hash indices

have use blocks for storing buckets.

(4)

Hashing Techniques (2)

▪ The information could be accessed fast: another type of primary file organization.

▪ Value search

▪ Bucketing (see hash join)

▪ We want to examine an arbitrary position in an array in O(1) time.

▪ For some hash data structures, in the worse case scenario the seek time is Θ(n)

▪ Allow to implement the dictionary operations (Insert,

Search, Delete).

(5)

Hashing Functions

▪ We want to store our information into a given number of records using blocks (and buckets).

▪ The search condition involves a (single) hash field.

▪ In most cases, such field is a key field of the file, in which case it is called hash key.

▪ We define a hashing function mapping each value into a range 0..M-1.

▪ e.g.: h(K)=K mod M

▪ If K is non-numeric, its byte representation is used

(6)

Collision Resolution

▪ It implies that some collisions may happed (exists k

and k’ s.t. h(k)=h(k’)) that have to be resolved (collision resolutions). Collision resolutions are the following:

▪ Chaining: the bucket array is exteded with overflow.

▪ Open addressing: from the occupied position, we seek the next free position.

▪ Multiple hashing: if one hash function generates a collision, we try with a next function…

▪ We’re going to see only the first technique.

(7)

Static Hashing

■ A fixed number of buckets M is allocated, and an hashing function mapping K to [0,…,M-1].

■ Each bucket has at least one block.

■ Each block is composed by m records.

■ E.g., H(K) returns the first/last M bits of K’s byte representation.

■ In this case, i = 1, N = 2

i

= 2, and each bucket contains a block

(8)

Static Hashing for Internal Files

(9)

Static Hashing for Internal Files

(10)

Static Hashing: searching 1100

0001 1001 1100 0

1

1100 H

■ The hash function computes the array index where the records

with a given hash field are stored

(11)

Ricerca di 1100 con hashing statico

0001 1001 1100 0

1

1100 H

■ The hash function computes the array index where the records

with a given hash field are stored

(12)

Static Hashing: Insert(1010)

0001 1001 1100 0

1

1010 H

■ We use the chaining method to extend a block with overflow

positions.

(13)

0001 1001 1100 0

1

1010 H

■ We use the chaining method to extend a block with overflow positions.

Static Hashing: Insert(1010)

(14)

0001 1001 1100 0

1

1010 H

1010

■ We use the chaining method to extend a block with overflow positions.

Static Hashing: Insert(1010)

(15)

How many blocks are there?

A √0 B 1

4

C 2 D 3

0 0001

(16)

How many buckets are there?

A √0 B 1

4

C 2 D 3

0001 1001 1100 0

1 1010

(17)

Static Hashing: Delete(1100)

0001 1001 1100 0

1

1100 H

1010

■ Overflows degradate static hashing's efficiency.

■ Value deletion could reduce the number of overflows.

(18)

Static Hashing: Delete(1100)

0001 1001 1100 0

1

1100 H

1010

■ Overflows degradate static hashing's efficiency.

■ Value deletion could reduce the number of overflows.

(19)

Static Hashing: Delete(1100)

0001 1001 1100 0

1

1100 H

1010

■ Overflows degradate static hashing's efficiency.

■ Value deletion could reduce the number of overflows.

(20)

Static Hashing: Delete(1100)

0001 1001 1010 0

1100 1 1010

■ Overflows degradate static hashing's efficiency.

■ Value deletion could reduce the number of overflows.

(21)

Static Hashing: Efficiency(?)

When no overflow occur, buckets could be accessed with a single access.

■ Then, m scans have to be performend if we want to reach a specific record.

■ Overflows cause a quick performance degradation

■ Efficency depends on:

■ Index size vs. Data size ratio, number of buckets.

■ Uniform hashing (e.g. H(k) is not a constant function…).

■ H(K) could vary at run time in order to reduce the number of overflows

(22)

Dynamic File Expansion

■ Collision resolution could be avoided by changing the number of the buckets dynamically.

■ Extendible Hashing: bucekts are used through an extendible “directory”, storing pointers to

buckets. Overflows are not used. The indexed buckets increase exponentially.

■ Linear Hashing: No “directory” are used.

Overflows are still used. The indexed buckets increase linearly.

■ Dynamic Hashing: (A precursor of extendible

(23)

External Hashing: a general example

(24)

Extendible Hashing (1)

■ Uses a level of indirection through an array of pointers to buckets (directory).

■ The directory grows by doubling their size. It has size 2

d

, increasing linearly by d (global depth).

■ Each entry in the directory points to a single bucket, but each bucket can be accessed from multiple

entries.

■ Each bucket contains a d’ variable (local depth) that

indicates how many bits are actually used to index it.

(25)

Extendible Hashing (2)

■ The hash function refreshes alongside with d: it

returns the most significant d bits of the binary coded search key.

■ Extendible hashing doesn’t use overflow blocks, as

the data structure is dynamically updated.

(26)

Extendible Hashing: Search(1100)

0001 1

1001 1100

1 0

d=1

1100 H 1

directory

global depth

(27)

0001 1

1001 1100

1 0

d=1

1100 H 1

directory

global depth

Extendible Hashing: Search(1100)

(28)

Extendible Hashing: Insert

■ Retrive the bucket where to store the value, if there is enough room, store it!

Otherwise, get d’.

If d’ < d

■ The block is halved (halving).

■ Keys are distributed within the halved blocks using d’ + 1 bits.

■ d' is incremented to d'+1

■ The directory is updated with the pointer to the newly-created block

If d’ = d

■ d is incremented to d+1 (doubling).

■ Each directory entry is doubled, such that each entry w produces two entries, w0 and w1.

■ Continue as previously stated

(29)

Extendible Hashing: Insert(1010) (1)

1010

0001 1

1001 1100 1 1

0 d=1 H

■ Retrive the bucket where to store the value

(30)

1010

0001 1

1001 1100 1 1

0 d=1 H

■ Retrive the bucket where to store the value

There is no room!

Extendible Hashing: Insert(1010) (1)

(31)

1001 1100

1 0001 1 i=1

0 H

1

1

1010

■ Compare d' and d

1010 H

1

1

1010 H 0

1

1

1010

i=1 H 0

1

1

1010

0001 1 i=1

H 0

1

1

1010

0001 1 i=1

H 0

1

1

1010

0001 1 i=1

0 H

1

1

1010 1001

1100

1 0001 1 d=1

H 0 1010 1

Extendible Hashing: Insert(1010) (2)

(32)

1001 1100

1 0001 1 d=2

H 0 1010 1

Si incrementa i di 1

■ Increase d by 1

Extendible Hashing: Insert(1010) (3)

(33)

1001 1 0001 1 d=2

H 0 1010 1

■ Each directory entry is doubled, such that each entry w produces two entries, w

0

and w

1

.

01 00

Extendible Hashing: Insert(1010) (3)

(34)

Extendible Hashing: Insert(1010) (4)

0001 1

1001 1100

1 00

d=2

1010 01

10 11 10 11 01 10 11 00 01 00 10 01 00

11 10 01 00

■ The block is halved

(35)

0001 1

1001 1 00

d=2

1010 01

01 00 01 00 01 00 01 00

▪ The block is halved

Extendible Hashing: Insert(1010) (4)

(36)

0001 1

1001 1100

1 00

d=2

1010 01

10

11 SPLIT

10 11 01 10 11 00 01 00 10 01 00

11 10 01 00

▪The block is halved

▪Keys are distributed within the halved blocks using d’+1 bits.

1010

Extendible Hashing: Insert(1010) (4)

(37)

0001 1

1001 1 00

d=2

1010 01

01 00 01 00 01 00 01 00

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits

▪ d' is increased by one

1001 2

Extendible Hashing: Insert(1010) (4)

(38)

0001 1

1001 1100

1 00

d=2

1010 01

10

11 SPLIT

10 11 01 10 11 00 01 00 10 01 00

11 10 01 00

1010 1001 1010

2

▪ The directory is updated with the pointer to the newly-created block

Extendible Hashing: Insert(1010) (4)

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits

▪ d' is increased by one

(39)

0001 1

1001 1010

2 00

d=2

0100 H 01

1100 2 10

11

▪ Retrive the bucket where to store the value

Extendible Hashing: Insert(0100)

(40)

0001 1

1001 1010

2 00

d=2

0100 H 01

1100 2 10

11

▪ Retrive the bucket where to store the value

Extendible Hashing: Insert(0100)

(41)

Extendible Hashing: Insert(0100)

0001 1

1001 1010

2 00

d=2

0100 H 01

1100 2 10

11

0100

▪ Retrive the bucket where to store the value

(42)

Extendible Hashing: Insert(0101) (1)

0001 0100

1

1001 1010

2 00

d=2

0101 H 01

1100 2 10

11

▪ Retrive the bucket where to store the value

(43)

Extendible Hashing: Insert(0101) (1)

0001 0100

1

1001 1010

2 00

d=2

0101 H 01

10

▪ Retrive the bucket where to store the value

(44)

d’ < d

Extendible Hashing: Insert(0101) (2)

Si confronta j con i

0001 0100

1

1001 1010

2 00

d=2

0101 H 01

1100 2 10

11

▪ Compare d' and d

(45)

Extendible Hashing: Insert(0101) (3)

00 d=2

▪ The block is halved

0001 0100

1

(46)

Extendible Hashing: Insert(0101) (3)

1001 1010

2 00

d=2

0101 01

10

SPLIT

▪ The block is halved

0001 0100

1

1

(47)

Extendible Hashing: Insert(0101) (3)

00 d=2

SPLIT

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits.

0001 0100

1

1 0001 1

0100 0101

1

(48)

Extendible Hashing: Insert(0101) (3)

1001 1010

2 00

d=2

0101 01

10

SPLIT

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits.

▪ d' is increased by one

0001 0100

1

1 0001 1

0100 0101 0100 1 0101

2

0001 2

(49)

Extendible Hashing: Insert(0101) (3)

00 d=2

SPLIT

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’ + 1 bits.

▪ d' is increased by one

▪ The directory is updated with the pointer to the newly-created block

0001 0100

1

1 0001 1

0100 0101 0100 1 0101

2

0001 2

(50)

Hashing estendibile: inserimento 1000 (1)

0100 0101

2

1001 1010

2 00

d=2

1000 H 01

1100 2 10

11

0001 2

▪ Retrive the bucket where to store the value

(51)

Hashing estendibile: inserimento 1000 (1)

0100 0101

2

1001 1010

2 00

d=2

1000 H 01

10 11

0001 2

▪ Retrive the bucket where to store the value

(52)

▪ Compare d' and d

0100 0101

2

1001 1010

2 00

d=2

1000 H 01

1100 2 10

11

0001 2

Hashing estendibile: inserimento 1000 (2)

(53)

▪ Increase d by 1

0100 0101

2

1001 2 00

d=3

H 01

0001 2

Hashing estendibile: inserimento 1000 (3)

(54)

▪Increase d by 1

▪Each directory entry is doubled, such that each entry w produces two entries, w

0

e w

1

.

0100 0101

2

1001 1010

2 00

i=3

1000 H 01

10 11

0001 2

Hashing estendibile: inserimento 1000 (4)

d=3

(55)

0100 0101

2

1001 1010 000 2

d=3

1000 001

010

0001 2

000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001

▪ The block is halved

Hashing estendibile: inserimento 1000 (4)

(56)

0100 0101

2

1001 1010 000 2

d=3

1000 001

010 011

0001 2

2

SPLIT 000

001 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001

▪ The block is halved

Hashing estendibile: inserimento 1000 (4)

(57)

0100 0101

2

1001 1010 000 2

d=3

1000 001

010

0001 2

000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits.

1001 1000

2

Hashing estendibile: inserimento 1000 (4)

(58)

Hashing estendibile: inserimento 1000 (4)

0100 0101

2

1001 1010 000 2

d=3

1000 001

010 011

0001 2

2

SPLIT 000

001 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001

▪ d' is increased by one

▪ The block is halved

▪ Keys are distributed within the halved blocks using d’+1 bits.

1001 1000

2 1010 2 1001

1000

3

1010 3

(59)

0100 0101

2

1001 1000 000 3

d=3

1000 001

010

0001 2

1010 3 000

001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001

Hashing estendibile: inserimento 1000 (5)

▪ The directory is updated with the pointer to the

newly-created block

(60)

0100 0101

2

1001 1000 000 3

d=3

1000 001

010 011

0001 2

1010 3 100

000 001 010 000 001 011 010 000 001

100 011 010 000 001

100 011 010 000 001

100 011 010 000 001

100 011 010 000 001

Hashing estendibile: inserimento 1000 (5)

▪ The directory is updated with the pointer to the

newly-created block

(61)

How many blocks are there?

A 3 B 5 C 6

D 8 0100

0101

2

1001 1000 000 3

d=3

001 010 011

0001 2

1010 3

100

(62)

How many buckets are there?

A 3 B 5 C 6

D 8 0100

0101

2

1001 1000 000 3

d=3

001 010 011

0001 2

1010 3 100

101

(63)

Extendible hashing: discussion

Pros:

■ The space overhead for the directory is negligible

■ Splitting causes minor reorganizations, since only the records in one bucket are redistributed to the two new buckets (e.g.

records that start with the same bit sequence).

Cons:

■ Directory must be searched before accessing the buckets themselves: we have two block accesses (directory+data file buckets). This penalty is considered minor.

■ The exponential memory increase in first memory is quite

inefficient (it allocates more space than the one actually

(64)

Linear Hashing

■ Hash file expands and shrinks without needing a directory.

■ Overflow blocks are allowed.

■ The number of buckets increases linearly.

■ Such increase happens when (i) a overflow block is

inserted or (ii) when a specific record-bucket ratio (file load factor, r/n) exeeds a guard value ℓ (e.g. ℓ=1.7).

■ H

i

(K) returns the less significant i bits.

(65)

Linear Hashing: Search(0101)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

0000 1111 0101 00

01

i=2 n=3

r=4

0101 H

2

1010

r/n = 1.33

(66)

0000 1111 0101 00

01

i=2 n=3

r=4 0101

10 1010

r/n = 1.33

Linear Hashing: Search(0101)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

H

2

(67)

0000 1111 0101 00

01

i=2 n=3

r=4 0101

1010

r/n = 1.33

Linear Hashing: Search(0101)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

H

2

(68)

0000 1111 0101 00

01

i=2 n=3

r=4

r/n = 1.33

1111

10 1010

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

Linear Hashing: Search(1111)

H

2

(69)

0000 1111 0101 00

01

i=2 n=3

r=4

r/n = 1.33

1111

1010

Linear Hashing: Search(1111)

H

2

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(70)

0000 1111 0101 00

01

i=2 n=3

r=4

r/n = 1.33

1111

10 1010 3 – 2 (2-1)

= 1

10

= 01

2

Linear Hashing: Search(1111)

H

2

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(71)

Linear Hashing: Insert (1)

■ r records, n buckets, file load factor (r/n), guard ℓ

■ When the file load factor is exceeded, a bucket gets splitted and a new n+1 bucket is created.

■ While using the hashing function H

i

, all the buckets up to the 2

i-1

-th are splitted.

■ When n>2

i

, H

i

is changed with H

i+1

and the buckets are splitted,

once again, from the first one…

(72)

■ if H

i

(k) = m < n:

■ store k in the m-th bucket (use overflows when necessary)

■ Otherwise

■ store k in the (m - 2

i -1

)-th bucket

■ Increase r, if the r/n>ℓ then:

■ if n = 2

i

, increase i by 1

■ (n)

2

=a

1

a

2

… a

i

having a

1

=1

■ Clear the first bit in n and store it to m. a

1

a

2

… a

i

→ 0a

2

… a

i

■ Allocate the n-th block

■ Move all the records from block m having the i-th rightmost bit as 1 into the n-th block.

■ Increase n

Linear Hashing: Insert (2)

(73)

Linear Hashing: Insert(0101) (1)

0000 1010 1111 0

1

i=1 n=2

r=3

r/n = 1.5

0101 H

1

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(74)

0000 1010 1111 0

1

i=1 n=2

r=3

r/n = 1.5

0101

Linear Hashing: Insert(0101) (1)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

H

1

(75)

0000 1010 1111 0

1

i=1 n=2

r=3

r/n = 1.5

0101

0101

Linear Hashing: Insert(0101) (1)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

H

1

(76)

0000 1010 1111 0

1

i=1 n=2

r=4

r/n = 1.5

0101

0101

r/n = 2 >1.7

▪ Increase r

Linear Hashing: Insert(0101) (1)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

H

1

(77)

r/n = 2

00

i=1 n=2

r=4 0000

1010

Linear Hashing: Insert(0101) (2)

(78)

r/n = 2

1111 0101 00

01

i=2 n=2

r=4

0101

▪ n = 2

1

, increase i by 1 (i=2)

▪ (n)

2

=10 (m)

2

=00

0000 1010

Linear Hashing: Insert(0101) (2)

(79)

r/n = 2

00

i=2 n=2

r=4

▪ Allocate the n-th block, having (n)

2

=10

0000 1010

Linear Hashing: Insert(0101) (2)

▪ n = 2

1

, increase i by 1 (i=2)

▪ (n)

2

=10 (m)

2

=00

(80)

r/n = 2

1111 0101 00

01

n=2 r=4

0101

▪ Allocate the n-th block, having (n)

2

=10

0000 1010

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

0000

Linear Hashing: Insert(0101) (2)

▪ n = 2

1

, increase i by 1 (i=2)

▪ (n)

2

=10 (m)

2

=00

i=2

(81)

r/n = 2

00

n=3 r=4

▪ Allocate the n-th block, having (n)

2

=10

0000 1010

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

▪ Increase n

0000

▪ n = 2

1

, increase i by 1 (i=2)

▪ (n)

2

=10 (m)

2

=00

Linear Hashing: Insert(0101) (2)

i=2

(82)

r/n = 2

1111 0101 00

01

r=4

r/n = 1.33 < 1.7

0101

▪ Allocate the n-th block, having (n)

2

=10

0000 1010 0000

Linear Hashing: Insert(0101) (2)

▪ n = 2

1

, increase i by 1 (i=2)

▪ (n)

2

=10 (m)

2

=00

n=3 i=2

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

▪ Increase n

(83)

0000 1111 0101 00

01

i=2 n=3

r=4

r/n = 1.33

0001 H

Linear Hashing: Insert(0001)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(84)

Linear Hashing: Insert(0001)

0000 1111 0101 00

01

i=2 n=3

r=4

r/n = 1.33

0001 H

2

10 1010

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(85)

0000 1111 0101 00

01

i=2 n=3

r=4

r/n = 1.33

0001 H

Linear Hashing: Insert(0001)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(86)

0000 1111 0101 00

01

i=2 n=3

r=4

r/n = 1.33

0001 H

2

10 1010

0001

Linear Hashing: Insert(0001)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(87)

Linear Hashing: Insert(0001)

0000 1111 0101 00

01

i=2 n=3

r=5

r/n = 1.33

0001 H 0001

▪ Increase r

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(88)

0000 1111 0101 00

01

i=2 n=3

r=5

r/n = 1.33

0001 H

2

10 1010

0001

r/n = 1.66

< 1.7

▪ Increase r

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

Linear Hashing: Insert(0001)

(89)

r/n = 1.66

0000 1111 0101 00

01

i=2 n=3

r=5

0110 H 0001

Linear Hashing: Insert(0110) (1)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(90)

r/n = 1.66

0000 1111 0101 00

01

i=2 n=3

r=5

0110 H

2

10 1010

0001

Linear Hashing: Insert(0110) (1)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(91)

r/n = 1.66

0000 1111 0101 00

01

i=2 n=3

r=5

0110 H 0001

Linear Hashing: Insert(0110) (1)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(92)

r/n = 1.66

0000 1111 0101 00

01

i=2 n=3

r=6

r/n = 2 >1.7

0110 H

2

10 1010

0001

▪ Increase r

1010

Linear Hashing: Insert(0110) (1)

▪Given n buckets (2

i-1

< n ≤ 2

i

).

▪H

i

(K) = m < n K is in the m-th bucket.

▪H

i

(K) = m ≥ n K is in the (m - 2

i-1

)-th bucket.

(93)

0000 1111 00

i=2 n=3

r=6

r/n = 2

0001

Linear Hashing: Insert(0110) (2)

▪ n ≠ 2

2

▪ (n)

2

=11 (m)

2

=01

(94)

0000 1111 0101 00

01

i=2 n=3

r=6

r/n = 2

0110

0001

▪ Allocate the n-th block, having n = 11

2

▪ n ≠ 2

2

▪ (n)

2

=11 (m)

2

=01

Linear Hashing: Insert(0110) (2)

(95)

0000 1111 00

i=2 n=3

r=6

r/n = 2

0001

▪ Allocate the n-th block, having n = 11

2

0001

▪ n ≠ 2

2

▪ (n)

2

=11 (m)

2

=01

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

Linear Hashing: Insert(0110) (2)

(96)

0000 1111 0101 00

01

i=2 n=3

r=6

r/n = 2

0110

0001 0101

▪ Allocate the n-th block, having n = 11

2

▪ n ≠ 2

2

▪ (n)

2

=11 (m)

2

=01

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

Linear Hashing: Insert(0110) (2)

(97)

0000 1111 00

i=2 n=4

r=6

r/n = 2

▪ Increase n

0001

▪ Allocate the n-th block, having n = 11

2

▪ n ≠ 2

2

▪ (n)

2

=11 (m)

2

=01

▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.

Linear Hashing: Insert(0110) (2)

(98)

0000 1111 0101 00

01

i=2 n=4

r=6

r/n = 2

0110

r/n = 1.5 < 1.7

0001 0101

Linear Hashing: Insert(0110) (2)

▪ Increase n

▪ Allocate the n-th block, having n = 11

2

▪ n ≠ 2

2

▪ (n)

2

=11 (m)

2

=01

▪ Move all the records from block m having the second rightmost

bit as 1 into the n-th block.

(99)

How many blocks are there?

A 2 B 3 C 4

D 5 0000

1111 0101 00

01

i=2 n=3

r=5

10 1010

0001

(100)

How many buckets are there?

A 2 B 3 C 4

D 5 0000

1111 0101 00

01

i=2 n=3

r=5

10 1010

0001

(101)

Information Retrieval

▪ Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

▪ These days we frequently think first of web search, but there are many other cases:

▪ E-mail search

▪ Searching your laptop

▪ Corporate knowledge bases

(102)

Basic Assumptions

▪ Document: unstructured file

▪ Collection: A set of documents

▪ Assume it is a static,

non-hypertext collection for the moment

▪ Query: set (of sequence of) of

keywords to express an information need

▪ Goal: Retrieve documents with

information that is relevant to the

user’s information need and helps

(103)

Inverted indexes

■ Inverted indexes are used to efficiently retrieve unstructured documents through fulltext queries. Such indices are “inverted”

because the index all the documents where a given word appears.

■ A document collection. D={d

1

,…,d

n

}

A vocabulary V is a set of distinct terms in the document set

An inverted index IX of a document collection is a vocabulary that attaches distinct terms with a list of all documents that contains the term.

for each v in V

(104)

Generating the Inverted Index (1)

vidi un magnifico disegno. Rappresentava un serpente boa nell’atto di inghiottire un

Voi che trovate tornando a casa il cibo caldo e visi amici

Considerate se questo è un uomo Una mattina, svegliandosi da sogni inquieti, Gregor Samsa si trovò nel suo letto trasformato in un insetto mostruoso D1

D2

D3

da 4

Gregor 7 in 15

… un 16

… a 5

casa 6 che 2

… animale 14

atto 10 boa 8

word number

(105)

a (D2,5)

animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4)

Gregor (D1,7)

da 4

Gregor 7 in 15

… un 16

… a 5

casa 6 che 2

… animale 14

Generating the Inverted Index (1)

(106)

Inverted Index: queries (1)

• Return documents containing “un” AND “atto”.

a (D2,5)

animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4)

Gregor (D1,7) in (D1,15)

un (D1,16), (D3,2), (D3,6), (D3,13) D1, D3 D3

D3

(107)

Stemming

▪ Reduce terms to their “roots” before indexing and reduces the size of the inverted index.

▪ “Stemming” suggests crude postfix chopping

▪ language dependent

▪ e.g., automate(s), automatic, automation all reduced to automat.

for example compressed and compression are both

for exampl compress

and compress ar both

(108)

Stop words

▪ With a stop list, you exclude from the dictionary entirely the commonest words. Intuition:

▪ They have little semantic content: the, a, and, to, be

▪ There are very frequent: ~30% of postings for top 30 words

▪ But the trend is away from doing this:

▪ It significantly reduces the size of an inverted index

▪ It allows “query optimizations” because the

machine-representation of a query won’t include such words.

▪ On the other hand, you need them for:

▪ Sentence Queries: “King of Denmark”

▪ Various song titles, etc.: “Let it be”, “To be or not to

(109)

Inverted Index: queries (2)

• Return documents containing “un atto”.

a (D2,5)

animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4)

Gregor (D1,7)

(D3,10)

D3:

2, 10 6, 10 13, 10

(110)

Inverted Index: queries (3)

• Return documents containing “un animale”.

a (D2,5)

animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4)

Gregor (D1,7) in (D1,15)

un (D1,16), (D3,2), (D3,6), (D3,13) (D1,16), (D3,2), (D3,6), (D3,13) (D3,14)

D3:

2, 14 6, 14 13, 14

Riferimenti

Documenti correlati

Enumeration types (or enum) is a user defined data type mainly. used to assign names to constants easy to

tweets/news/search results (visualization) Users could also be Web pages (dedup), products (recommendation),. tweets/news/search

tweets/news/search results (visualization) Users could also be Web pages (dedup), products (recommendation),.. tweets/news/search

tweets/news/search results (visualization) Users could also be Web pages (dedup), products (recommendation),. tweets/news/search

• Con un appropriato dimensionamento della tabella hash, è possibile avere nel caso medio un costo (1) per tutte le operazioni. ‣ Intuitivamente, le liste di trabocco sono

• I dati in memoria secondaria possono essere utilizzati solo se prima trasferiti in memoria principale... Memoria principale e

▪ In questo ultimo caso, al livello superiore deve essere inserita una nuova chiave per indicizzare il nuovo nodo creato.

■ Read Committed: the transaction asks exclusive locks prior to updating data i dati and keeps them until its end, asks for shared locks for reading operations that are released