Databases
DBMS Architecture:
Hashing Techniques (RDBMS)
References
▪ Hashing Techniques:
▪ Elmasri, 7th Ed. Chapter 16, section 8.
▪ Cormen, 3rd Ed. Chapter 11.
▪ Inverted indexing:
▪ Elmasri, 7th Ed. Chapter 27, section 5.
Hashing Techniques (1)
■ High efficient indexes for equality search.
■ They could be used both for internal or external files.
■ We could classify our techniques in other two types:
■ Static Hashing: for fixed size non-mutable data utilizzabile (e.g.
single session CD-ROM/DVD/BlueRay).
■ Hashing for Dynamic File Expansion: when both data and data sizes could vary in time.
■ As for the indices for secondary memory trees, hash indices
have use blocks for storing buckets.
Hashing Techniques (2)
▪ The information could be accessed fast: another type of primary file organization.
▪ Value search
▪ Bucketing (see hash join)
▪ We want to examine an arbitrary position in an array in O(1) time.
▪ For some hash data structures, in the worse case scenario the seek time is Θ(n)
▪ Allow to implement the dictionary operations (Insert,
Search, Delete).
Hashing Functions
▪ We want to store our information into a given number of records using blocks (and buckets).
▪ The search condition involves a (single) hash field.
▪ In most cases, such field is a key field of the file, in which case it is called hash key.
▪ We define a hashing function mapping each value into a range 0..M-1.
▪ e.g.: h(K)=K mod M
▪ If K is non-numeric, its byte representation is used
Collision Resolution
▪ It implies that some collisions may happed (exists k
and k’ s.t. h(k)=h(k’)) that have to be resolved (collision resolutions). Collision resolutions are the following:
▪ Chaining: the bucket array is exteded with overflow.
▪ Open addressing: from the occupied position, we seek the next free position.
▪ Multiple hashing: if one hash function generates a collision, we try with a next function…
▪ We’re going to see only the first technique.
Static Hashing
■ A fixed number of buckets M is allocated, and an hashing function mapping K to [0,…,M-1].
■ Each bucket has at least one block.
■ Each block is composed by m records.
■ E.g., H(K) returns the first/last M bits of K’s byte representation.
■ In this case, i = 1, N = 2
i= 2, and each bucket contains a block
Static Hashing for Internal Files
Static Hashing for Internal Files
Static Hashing: searching 1100
0001 1001 1100 0
1
1100 H
■ The hash function computes the array index where the records
with a given hash field are stored
Ricerca di 1100 con hashing statico
0001 1001 1100 0
1
1100 H
■ The hash function computes the array index where the records
with a given hash field are stored
Static Hashing: Insert(1010)
0001 1001 1100 0
1
1010 H
■ We use the chaining method to extend a block with overflow
positions.
0001 1001 1100 0
1
1010 H
■ We use the chaining method to extend a block with overflow positions.
Static Hashing: Insert(1010)
0001 1001 1100 0
1
1010 H
1010
■ We use the chaining method to extend a block with overflow positions.
Static Hashing: Insert(1010)
How many blocks are there?
A √0 B 1
4C 2 D 3
0 0001
How many buckets are there?
A √0 B 1
4C 2 D 3
0001 1001 1100 0
1 1010
Static Hashing: Delete(1100)
0001 1001 1100 0
1
1100 H
1010
■ Overflows degradate static hashing's efficiency.
■ Value deletion could reduce the number of overflows.
Static Hashing: Delete(1100)
0001 1001 1100 0
1
1100 H
1010
■ Overflows degradate static hashing's efficiency.
■ Value deletion could reduce the number of overflows.
Static Hashing: Delete(1100)
0001 1001 1100 0
1
1100 H
1010
■ Overflows degradate static hashing's efficiency.
■ Value deletion could reduce the number of overflows.
Static Hashing: Delete(1100)
0001 1001 1010 0
1100 1 1010
■ Overflows degradate static hashing's efficiency.
■ Value deletion could reduce the number of overflows.
Static Hashing: Efficiency(?)
■ When no overflow occur, buckets could be accessed with a single access.
■ Then, m scans have to be performend if we want to reach a specific record.
■ Overflows cause a quick performance degradation
■ Efficency depends on:
■ Index size vs. Data size ratio, number of buckets.
■ Uniform hashing (e.g. H(k) is not a constant function…).
■ H(K) could vary at run time in order to reduce the number of overflows
Dynamic File Expansion
■ Collision resolution could be avoided by changing the number of the buckets dynamically.
■ Extendible Hashing: bucekts are used through an extendible “directory”, storing pointers to
buckets. Overflows are not used. The indexed buckets increase exponentially.
■ Linear Hashing: No “directory” are used.
Overflows are still used. The indexed buckets increase linearly.
■ Dynamic Hashing: (A precursor of extendible
External Hashing: a general example
Extendible Hashing (1)
■ Uses a level of indirection through an array of pointers to buckets (directory).
■ The directory grows by doubling their size. It has size 2
d, increasing linearly by d (global depth).
■ Each entry in the directory points to a single bucket, but each bucket can be accessed from multiple
entries.
■ Each bucket contains a d’ variable (local depth) that
indicates how many bits are actually used to index it.
Extendible Hashing (2)
■ The hash function refreshes alongside with d: it
returns the most significant d bits of the binary coded search key.
■ Extendible hashing doesn’t use overflow blocks, as
the data structure is dynamically updated.
Extendible Hashing: Search(1100)
0001 1
1001 1100
1 0
d=1
1100 H 1
directory
global depth
0001 1
1001 1100
1 0
d=1
1100 H 1
directory
global depth
Extendible Hashing: Search(1100)
Extendible Hashing: Insert
■ Retrive the bucket where to store the value, if there is enough room, store it!
■ Otherwise, get d’.
■ If d’ < d
■ The block is halved (halving).
■ Keys are distributed within the halved blocks using d’ + 1 bits.
■ d' is incremented to d'+1
■ The directory is updated with the pointer to the newly-created block
■ If d’ = d
■ d is incremented to d+1 (doubling).
■ Each directory entry is doubled, such that each entry w produces two entries, w0 and w1.
■ Continue as previously stated
Extendible Hashing: Insert(1010) (1)
1010
0001 1
1001 1100 1 1
0 d=1 H
■ Retrive the bucket where to store the value
1010
0001 1
1001 1100 1 1
0 d=1 H
■ Retrive the bucket where to store the value
There is no room!
Extendible Hashing: Insert(1010) (1)
1001 1100
1 0001 1 i=1
0 H
1
1
1010
■ Compare d' and d
1010 H
1
1
1010 H 0
1
1
1010
i=1 H 0
1
1
1010
0001 1 i=1
H 0
1
1
1010
0001 1 i=1
H 0
1
1
1010
0001 1 i=1
0 H
1
1
1010 1001
1100
1 0001 1 d=1
H 0 1010 1
Extendible Hashing: Insert(1010) (2)
1001 1100
1 0001 1 d=2
H 0 1010 1
Si incrementa i di 1
■ Increase d by 1
Extendible Hashing: Insert(1010) (3)
1001 1 0001 1 d=2
H 0 1010 1
■ Each directory entry is doubled, such that each entry w produces two entries, w
0and w
1.
01 00
Extendible Hashing: Insert(1010) (3)
Extendible Hashing: Insert(1010) (4)
0001 1
1001 1100
1 00
d=2
1010 01
10 11 10 11 01 10 11 00 01 00 10 01 00
11 10 01 00
■ The block is halved
0001 1
1001 1 00
d=2
1010 01
01 00 01 00 01 00 01 00
▪ The block is halved
Extendible Hashing: Insert(1010) (4)
0001 1
1001 1100
1 00
d=2
1010 01
10
11 SPLIT
10 11 01 10 11 00 01 00 10 01 00
11 10 01 00
▪The block is halved
▪Keys are distributed within the halved blocks using d’+1 bits.
1010
Extendible Hashing: Insert(1010) (4)
0001 1
1001 1 00
d=2
1010 01
01 00 01 00 01 00 01 00
▪ The block is halved
▪ Keys are distributed within the halved blocks using d’+1 bits
▪ d' is increased by one
1001 2
Extendible Hashing: Insert(1010) (4)
0001 1
1001 1100
1 00
d=2
1010 01
10
11 SPLIT
10 11 01 10 11 00 01 00 10 01 00
11 10 01 00
1010 1001 1010
2
▪ The directory is updated with the pointer to the newly-created block
Extendible Hashing: Insert(1010) (4)
▪ The block is halved
▪ Keys are distributed within the halved blocks using d’+1 bits
▪ d' is increased by one
0001 1
1001 1010
2 00
d=2
0100 H 01
1100 2 10
11
▪ Retrive the bucket where to store the value
Extendible Hashing: Insert(0100)
0001 1
1001 1010
2 00
d=2
0100 H 01
1100 2 10
11
▪ Retrive the bucket where to store the value
Extendible Hashing: Insert(0100)
Extendible Hashing: Insert(0100)
0001 1
1001 1010
2 00
d=2
0100 H 01
1100 2 10
11
0100
▪ Retrive the bucket where to store the value
Extendible Hashing: Insert(0101) (1)
0001 0100
1
1001 1010
2 00
d=2
0101 H 01
1100 2 10
11
▪ Retrive the bucket where to store the value
Extendible Hashing: Insert(0101) (1)
0001 0100
1
1001 1010
2 00
d=2
0101 H 01
10
▪ Retrive the bucket where to store the value
d’ < d
Extendible Hashing: Insert(0101) (2)
Si confronta j con i
0001 0100
1
1001 1010
2 00
d=2
0101 H 01
1100 2 10
11
▪ Compare d' and d
Extendible Hashing: Insert(0101) (3)
00 d=2
▪ The block is halved
0001 0100
1
Extendible Hashing: Insert(0101) (3)
1001 1010
2 00
d=2
0101 01
10
SPLIT
▪ The block is halved
0001 0100
1
1
Extendible Hashing: Insert(0101) (3)
00 d=2
SPLIT
▪ The block is halved
▪ Keys are distributed within the halved blocks using d’+1 bits.
0001 0100
1
1 0001 1
0100 0101
1
Extendible Hashing: Insert(0101) (3)
1001 1010
2 00
d=2
0101 01
10
SPLIT
▪ The block is halved
▪ Keys are distributed within the halved blocks using d’+1 bits.
▪ d' is increased by one
0001 0100
1
1 0001 1
0100 0101 0100 1 0101
2
0001 2
Extendible Hashing: Insert(0101) (3)
00 d=2
SPLIT
▪ The block is halved
▪ Keys are distributed within the halved blocks using d’ + 1 bits.
▪ d' is increased by one
▪ The directory is updated with the pointer to the newly-created block
0001 0100
1
1 0001 1
0100 0101 0100 1 0101
2
0001 2
Hashing estendibile: inserimento 1000 (1)
0100 0101
2
1001 1010
2 00
d=2
1000 H 01
1100 2 10
11
0001 2
▪ Retrive the bucket where to store the value
Hashing estendibile: inserimento 1000 (1)
0100 0101
2
1001 1010
2 00
d=2
1000 H 01
10 11
0001 2
▪ Retrive the bucket where to store the value
▪ Compare d' and d
0100 0101
2
1001 1010
2 00
d=2
1000 H 01
1100 2 10
11
0001 2
Hashing estendibile: inserimento 1000 (2)
▪ Increase d by 1
0100 0101
2
1001 2 00
d=3
H 01
0001 2
Hashing estendibile: inserimento 1000 (3)
▪Increase d by 1
▪Each directory entry is doubled, such that each entry w produces two entries, w
0e w
1.
0100 0101
2
1001 1010
2 00
i=3
1000 H 01
10 11
0001 2
Hashing estendibile: inserimento 1000 (4)
d=3
0100 0101
2
1001 1010 000 2
d=3
1000 001
010
0001 2
000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001
▪ The block is halved
Hashing estendibile: inserimento 1000 (4)
0100 0101
2
1001 1010 000 2
d=3
1000 001
010 011
0001 2
2
SPLIT 000
001 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001
▪ The block is halved
Hashing estendibile: inserimento 1000 (4)
0100 0101
2
1001 1010 000 2
d=3
1000 001
010
0001 2
000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001
▪ The block is halved
▪ Keys are distributed within the halved blocks using d’+1 bits.
1001 1000
2
Hashing estendibile: inserimento 1000 (4)
Hashing estendibile: inserimento 1000 (4)
0100 0101
2
1001 1010 000 2
d=3
1000 001
010 011
0001 2
2
SPLIT 000
001 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001 011 010 000 001
▪ d' is increased by one
▪ The block is halved
▪ Keys are distributed within the halved blocks using d’+1 bits.
1001 1000
2 1010 2 1001
1000
3
1010 3
0100 0101
2
1001 1000 000 3
d=3
1000 001
010
0001 2
1010 3 000
001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001 010 000 001
Hashing estendibile: inserimento 1000 (5)
▪ The directory is updated with the pointer to the
newly-created block
0100 0101
2
1001 1000 000 3
d=3
1000 001
010 011
0001 2
1010 3 100
000 001 010 000 001 011 010 000 001
100 011 010 000 001
100 011 010 000 001
100 011 010 000 001
100 011 010 000 001
Hashing estendibile: inserimento 1000 (5)
▪ The directory is updated with the pointer to the
newly-created block
How many blocks are there?
A 3 B 5 C 6
D 8 0100
0101
2
1001 1000 000 3
d=3
001 010 011
0001 2
1010 3
100
How many buckets are there?
A 3 B 5 C 6
D 8 0100
0101
2
1001 1000 000 3
d=3
001 010 011
0001 2
1010 3 100
101
Extendible hashing: discussion
Pros:
■ The space overhead for the directory is negligible
■ Splitting causes minor reorganizations, since only the records in one bucket are redistributed to the two new buckets (e.g.
records that start with the same bit sequence).
Cons:
■ Directory must be searched before accessing the buckets themselves: we have two block accesses (directory+data file buckets). This penalty is considered minor.
■ The exponential memory increase in first memory is quite
inefficient (it allocates more space than the one actually
Linear Hashing
■ Hash file expands and shrinks without needing a directory.
■ Overflow blocks are allowed.
■ The number of buckets increases linearly.
■ Such increase happens when (i) a overflow block is
inserted or (ii) when a specific record-bucket ratio (file load factor, r/n) exeeds a guard value ℓ (e.g. ℓ=1.7).
■ H
i(K) returns the less significant i bits.
Linear Hashing: Search(0101)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
0000 1111 0101 00
01
i=2 n=3
r=4
0101 H
21010
r/n = 1.33
0000 1111 0101 00
01
i=2 n=3
r=4 0101
10 1010
r/n = 1.33
Linear Hashing: Search(0101)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
H
20000 1111 0101 00
01
i=2 n=3
r=4 0101
1010
r/n = 1.33
Linear Hashing: Search(0101)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
H
20000 1111 0101 00
01
i=2 n=3
r=4
r/n = 1.33
1111
10 1010
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
Linear Hashing: Search(1111)
H
20000 1111 0101 00
01
i=2 n=3
r=4
r/n = 1.33
1111
1010
Linear Hashing: Search(1111)
H
2▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
0000 1111 0101 00
01
i=2 n=3
r=4
r/n = 1.33
1111
10 1010 3 – 2 (2-1)
= 1
10= 01
2Linear Hashing: Search(1111)
H
2▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
Linear Hashing: Insert (1)
■ r records, n buckets, file load factor (r/n), guard ℓ
■ When the file load factor is exceeded, a bucket gets splitted and a new n+1 bucket is created.
■ While using the hashing function H
i, all the buckets up to the 2
i-1-th are splitted.
■ When n>2
i, H
iis changed with H
i+1and the buckets are splitted,
once again, from the first one…
■ if H
i(k) = m < n:
■ store k in the m-th bucket (use overflows when necessary)
■ Otherwise
■ store k in the (m - 2
i -1)-th bucket
■ Increase r, if the r/n>ℓ then:
■ if n = 2
i, increase i by 1
■ (n)
2=a
1a
2… a
ihaving a
1=1
■ Clear the first bit in n and store it to m. a
1a
2… a
i→ 0a
2… a
i■ Allocate the n-th block
■ Move all the records from block m having the i-th rightmost bit as 1 into the n-th block.
■ Increase n
Linear Hashing: Insert (2)
Linear Hashing: Insert(0101) (1)
0000 1010 1111 0
1
i=1 n=2
r=3
r/n = 1.5
0101 H
1▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
0000 1010 1111 0
1
i=1 n=2
r=3
r/n = 1.5
0101
Linear Hashing: Insert(0101) (1)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
H
10000 1010 1111 0
1
i=1 n=2
r=3
r/n = 1.5
0101
0101
Linear Hashing: Insert(0101) (1)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
H
10000 1010 1111 0
1
i=1 n=2
r=4
r/n = 1.5
0101
0101
r/n = 2 >1.7
▪ Increase r
Linear Hashing: Insert(0101) (1)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
H
1r/n = 2
00
i=1 n=2
r=4 0000
1010
Linear Hashing: Insert(0101) (2)
r/n = 2
1111 0101 00
01
i=2 n=2
r=4
0101
▪ n = 2
1, increase i by 1 (i=2)
▪ (n)
2=10 (m)
2=00
0000 1010
Linear Hashing: Insert(0101) (2)
r/n = 2
00
i=2 n=2
r=4
▪ Allocate the n-th block, having (n)
2=10
0000 1010
Linear Hashing: Insert(0101) (2)
▪ n = 2
1, increase i by 1 (i=2)
▪ (n)
2=10 (m)
2=00
r/n = 2
1111 0101 00
01
n=2 r=4
0101
▪ Allocate the n-th block, having (n)
2=10
▪
0000 1010
▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.
0000
Linear Hashing: Insert(0101) (2)
▪ n = 2
1, increase i by 1 (i=2)
▪ (n)
2=10 (m)
2=00
i=2
r/n = 2
00
n=3 r=4
▪ Allocate the n-th block, having (n)
2=10
0000 1010
▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.
▪ Increase n
0000
▪ n = 2
1, increase i by 1 (i=2)
▪ (n)
2=10 (m)
2=00
Linear Hashing: Insert(0101) (2)
i=2
r/n = 2
1111 0101 00
01
r=4
r/n = 1.33 < 1.7
0101
▪ Allocate the n-th block, having (n)
2=10
0000 1010 0000
Linear Hashing: Insert(0101) (2)
▪ n = 2
1, increase i by 1 (i=2)
▪ (n)
2=10 (m)
2=00
n=3 i=2
▪ Move all the records from block m having the second rightmost bit as 1 into the n-th block.
▪ Increase n
0000 1111 0101 00
01
i=2 n=3
r=4
r/n = 1.33
0001 H
Linear Hashing: Insert(0001)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
Linear Hashing: Insert(0001)
0000 1111 0101 00
01
i=2 n=3
r=4
r/n = 1.33
0001 H
2
10 1010
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
0000 1111 0101 00
01
i=2 n=3
r=4
r/n = 1.33
0001 H
Linear Hashing: Insert(0001)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
0000 1111 0101 00
01
i=2 n=3
r=4
r/n = 1.33
0001 H
2
10 1010
0001
Linear Hashing: Insert(0001)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
Linear Hashing: Insert(0001)
0000 1111 0101 00
01
i=2 n=3
r=5
r/n = 1.33
0001 H 0001
▪ Increase r
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
0000 1111 0101 00
01
i=2 n=3
r=5
r/n = 1.33
0001 H
2
10 1010
0001
r/n = 1.66
< 1.7
▪ Increase r
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
Linear Hashing: Insert(0001)
r/n = 1.66
0000 1111 0101 00
01
i=2 n=3
r=5
0110 H 0001
Linear Hashing: Insert(0110) (1)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
r/n = 1.66
0000 1111 0101 00
01
i=2 n=3
r=5
0110 H
2
10 1010
0001
Linear Hashing: Insert(0110) (1)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
r/n = 1.66
0000 1111 0101 00
01
i=2 n=3
r=5
0110 H 0001
Linear Hashing: Insert(0110) (1)
▪Given n buckets (2
i-1< n ≤ 2
i).
▪H
i(K) = m < n K is in the m-th bucket.
▪H
i(K) = m ≥ n K is in the (m - 2
i-1)-th bucket.
r/n = 1.66
0000 1111 0101 00
01
i=2 n=3
r=6
r/n = 2 >1.7
0110 H
2