Dictionary search
Exact string search
Paper on Cuckoo Hashing
Exact String Search
Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them.
Hashing
Hashing with chaining
Key issue: a good hash function
Basic assumption: Uniform hashing
Avg #keys per slot = n * (1/m) = n/m
= (load factor)
Search cost
m = (n)
In practice
A trivial hash function is:
prime
A “provably good” hash is
Each a i is selected at random in [0,m)
k 0 k 1 k 2 k r
≈log 2 m
r ≈ L / log 2 m
a 0 a 1 a 2 a r
K
a
prime
l = max string len m = table size
not necessarily: (...mod p) mod m
Cuckoo Hashing
A B C
E D
2 hash tables, and 2 random choices where an item can be
stored
A B C
E D
F
A running example
A B C F
E D
A running example
A B C F
E D
G
A running example
E G B C F
A D
A running example
Cuckoo Hashing Examples
A B C
E D F
G
Random (bipartite) graph: node=cell, edge=key
Natural Extensions
More than 2 hashes (choices) per key.
Very different: hypergraphs instead of graphs.
Higher memory utilization
3 choices : 90+% in experiments
4 choices : about 97%
2 hashes + bins of B-size.
Balanced allocation and tightly O(1)-size bins
Insertion sees a tree of possible evict+ins paths
but more insert time (and random access)
more memory
...but more local
Dictionary search
Making one-side errors
Paper on Bloom Filter
What does it mean to crawl ?
• What to crawl in the Web ?
• How deep to crawl a site ?
• How often to crawl a page ?
Crawling
How to keep track of the URLs visited by a crawler?
URLs are long
Check should be very fast
No care about small errors (≈ page not crawled)
Bloom Filter
over crawled URLs
Searching with errors...
Problem: false positives
TTT 2
Not perfectly true but...
m/n = 8
Opt k = 5.45...
We do have an
explicit formula
for the optimal k
Dictionary search
Prefix-string search
Reading 3.1 and 5.2
Prefix-string Search
Given a dictionary D of K strings, of total length N,
store them in a way that we can efficiently support
prefix searches for a pattern P over them.
Trie: speeding-up searches
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
Pro: O(p) search time
Cons: edge + node labels and tree structure
Front-coding: squeezing strings
http://checkmate.com/All_Natural/
http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html
http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html ...
0 http://checkmate.com/All_Natural/
33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html
25 Essence_Oils.html
25 Mineral_Bath_Crystals.html 38 Salt.html
33 Cream.html
33 45%
0 http://checkmate.com/All/Natural/Washcloth.html...
….systile syzygetic syzygial syzygy….
2 5 5
Gzip may be much better...
….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….
systile szaielyite CT
on a sample
2-level indexing
Disk
Internal
Memory A disadvantage:
•
Trade-off ≈ speed vs space (because of bucket size)
2 advantages:
•
Search ≈ typically 1 I/O
•