Dictionary data structures
for the Inverted Index
This lecture
Dictionary data structures
Exact search
Prefix search
“Tolerant” retrieval
Edit-distance queries
Wild-card queries
Spelling correction
Soundex
Ch. 3
Basics
The dictionary data structure stores the term vocabulary, but… in what data
structure?
Sec. 3.1
A naïve dictionary
An array of struct:
char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes
How do we store a dictionary in memory efficiently?
How do we quickly look up elements at query time?
Sec. 3.1
Dictionary data structures
Two main choices:
Hash table
Tree
Trie
Some IR systems use hashes, some trees/tries
Sec. 3.1
Hashing with chaining
Key issue: a good hash function
Basic assumption: Uniform hashing
Avg #keys per slot = n * (1/m) = n/m = (load factor)
m = (n)
In practice
A trivial hash function is:
prime
Integer (if string?)
The current version is MurmurHash (vers 3) yields a 32-bit
or 128-bit hash value.
Another “provably good” hash
Each a i is selected at random in [0,m)
k 0 k 1 k 2 k r
≈log 2 m
r ≈ L / log 2 m
a 0 a 1 a 2 a r
Key
a
prime
L = bit len of string m = table size
not necessarily: (...mod p) mod m
Hashes
Each vocabulary term is hashed to an integer
Pros:
Lookup is faster than for a tree: O(1) on avg
Turn to worst-case by Cuckoo hashing
Cons:
Only exact search: e.g. judgment/judgement
If vocabulary keeps growing, you need to occasionally do the expensive rehashing
Cuckoo hash «limits» those rehashes
Sec. 3.1
Cuckoo Hashing
A
2 hash tables, and 2 random choices where an item can be
stored
A B C
E D
F
A running example
A B C F
E D
A running example
A B C F
E D
G
A running example
E G B C F
A D
A running example
We reverse the traversed edges keeps track of other position
Random (bipartite) graph: node=cell, edge=key
1 cycle, not a problem
A B
D F
G
If #cells > 2 * #keys, then low prob of cycling (but <50% space occupancy)
D G
B A
F
The double cycle (!!!)
E B C
A D F
G
D G C
E B F
A
D A C
E B F
G
D A F
E C G
B
If #cells > 2 * #keys, then low prob of 1 cycle
Low prob of 2 joined-cycles
Extensions for speed/space
More than 2 hashes (choices) per key.
Very different: hypergraphs instead of graphs.
Higher memory utilization
3 choices : 90%
4 choices : about 97%
2 hashes + bins of B-size.
Balanced allocation and tightly O(1)-size bins
Insertion sees a tree of possible evict+ins paths
but more insert time (and random access)
more memory
...but more local
Prefix search
Prefix-string Search
Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.
Ex. Pattern P is pa
Dict = {abaco, box, paolo, patrizio, pippo,
zoo}
Trie: speeding-up searches
1
2 2
0
4
5
6
7
2 3
y
s
1 z
stile zyg
5 etic
ial ygy
aibelyite
czecin omo
Pro: O(p) search time = path scan
Cons: edge + node labels + tree structure
Tries
Do exist many variants and their implementations
Pros:
Solves the prefix problem
Cons:
Slower: O(p) time, many cache misses
From 10 to 60 (or, even more) bytes per node
Sec. 3.1
systile syzygetic syzigial syzygygy szaibelyite szczecin szomo systile szaibelyite
CT
on a sample
2-level indexing
Disk
Internal
Memory A disadvantage:
•
Trade-off ≈ speed vs space (because of bucket size)
2 advantages:
•
Search ≈ typically 1 I/O + in-mem comp.
•
Space ≈ trie built over a subset of strings
Front-coding: squeezing dict
http://checkmate.com/All_Natural/
http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html
http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html ...
0 http://checkmate.com/All_Natural/
33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html
25 Essence_Oils.html
25 Mineral_Bath_Crystals.html 38 Salt.html
33 Cream.html
33 45%
0 http://checkmate.com/All/Natural/Washcloth.html...
systile syzygetic syzygial syzygy
2 5 5
Gzip may be much better...
….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….
systile szaielyite CT
on a sample
2-level indexing
Disk
Internal
Memory A disadvantage:
•
Trade-off ≈ speed vs space (because of bucket size)
2 advantages:
•
Search ≈ typically 1 I/O + in-mem comp.
•
Space ≈ Trie built over a subset of strings
and front-coding over buckets
Spelling correction
Spell correction
Two principal uses
Correcting document(s) being indexed
Correcting queries to retrieve “right” answers
Two main flavors:
Isolated word
Check each word on its own for misspelling
But what about: from form
Context-sensitive is more effective
Look at surrounding words
e.g., I flew form Heathrow to Narita.
Sec. 3.3
Isolated word correction
Fundamental premise – there is a lexicon from which the correct spellings come
Two basic choices for this
A standard lexicon such as
Webster’s English Dictionary
An “industry-specific” lexicon – hand-maintained
The lexicon of the indexed corpus
E.g., all words on the web
All names, acronyms etc. (including the mis-spellings)