Dictionary data structures for the Inverted Index

(1)

Dictionary data structures

for the Inverted Index

(2)

This lecture

 Dictionary data structures

 Exact search

 Prefix search

 “Tolerant” retrieval

 Edit-distance queries

 Wild-card queries

 Spelling correction

 Soundex

Ch. 3

(3)

Basics

 The dictionary data structure stores the term vocabulary, but… in what data

structure?

Sec. 3.1

(4)

A naïve dictionary

 An array of struct:

char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes

 How do we store a dictionary in memory efficiently?

 How do we quickly look up elements at query time?

Sec. 3.1

(5)

Dictionary data structures

 Two main choices:

 Hash table

 Tree

 Trie

 Some IR systems use hashes, some trees/tries

Sec. 3.1

(6)

Hashing with chaining

(7)

Key issue: a good hash function

Basic assumption: Uniform hashing

Avg #keys per slot = n * (1/m) = n/m = (load factor)

m = (n)

(8)

In practice

A trivial hash function is:

prime

Integer (if string?)

The current version is MurmurHash (vers 3) yields a 32-bit

or 128-bit hash value.

(9)

Another “provably good” hash

 Each a _i is selected at random in [0,m)

k ₀ k ₁ k ₂ k _r

≈log ₂ m

r ≈ L / log ₂ m

a ₀ a ₁ a ₂ a _r

Key

a

prime

L = bit len of string m = table size

not necessarily: (...mod p) mod m

(10)

Hashes

 Each vocabulary term is hashed to an integer

 Pros:

 Lookup is faster than for a tree: O(1) on avg

 Turn to worst-case by Cuckoo hashing

 Cons:

 Only exact search: e.g. judgment/judgement

 If vocabulary keeps growing, you need to occasionally do the expensive rehashing

 Cuckoo hash «limits» those rehashes

Sec. 3.1

(11)

Cuckoo Hashing

A

2 hash tables, and 2 random choices where an item can be

stored

(12)

A B C

E D

F

A running example

(13)

A B C F

E D

A running example

(14)

A B C F

E D

G

A running example

(15)

E G B C F

A D

A running example

We reverse the traversed edges  keeps track of other position

Random (bipartite) graph: node=cell, edge=key

(16)

1 cycle, not a problem

A B

D ^F

G

If #cells > 2 * #keys, then low prob of cycling (but <50% space occupancy)

D G

B A

F

(17)

The double cycle (!!!)

E B C

A D F

G

D G C

E B F

A

D A C

E B F

G

D A F

E C G

B

If #cells > 2 * #keys, then low prob of 1 cycle

 Low prob of 2 joined-cycles

(18)

Extensions for speed/space

 More than 2 hashes (choices) per key.

 Very different: hypergraphs instead of graphs.

 Higher memory utilization



3 choices : 90%



4 choices : about 97%

 2 hashes + bins of B-size.

 Balanced allocation and tightly O(1)-size bins

 Insertion sees a tree of possible evict+ins paths

but more insert time (and random access)

more memory

...but more local

(19)

Prefix search

(20)

Prefix-string Search

Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.

Ex. Pattern P is pa

Dict = {abaco, box, paolo, patrizio, pippo,

zoo}

(21)

Trie: speeding-up searches

1 2 2

0

4

5

6

7 2 3

y

s

1 z

stile zyg

5 etic

ial ygy

aibelyite

czecin omo

Pro: O(p) search time = path scan

Cons: edge + node labels + tree structure

(22)

Tries

Do exist many variants and their implementations

 Pros:

 Solves the prefix problem

 Cons:

 Slower: O(p) time, many cache misses

 From 10 to 60 (or, even more) bytes per node

Sec. 3.1

(23)

systile syzygetic syzigial syzygygy szaibelyite szczecin szomo systile szaibelyite

CT

on a sample

2-level indexing

Disk

Internal

Memory A disadvantage:

•

Trade-off ≈ speed vs space (because of bucket size)

2 advantages:

•

Search ≈ typically 1 I/O + in-mem comp.

•

Space ≈ trie built over a subset of strings

(24)

Front-coding: squeezing dict

http://checkmate.com/All_Natural/

http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html

http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html ...

0 http://checkmate.com/All_Natural/

33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html

25 Essence_Oils.html

25 Mineral_Bath_Crystals.html 38 Salt.html

33 Cream.html

33  45%

0 http://checkmate.com/All/Natural/Washcloth.html...

systile syzygetic syzygial syzygy

2 5 5

Gzip may be much better...

(25)

….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….

systile szaielyite CT

on a sample

2-level indexing

Disk

Internal

Memory A disadvantage:

•

Trade-off ≈ speed vs space (because of bucket size)

2 advantages:

•

Search ≈ typically 1 I/O + in-mem comp.

•

Space ≈ Trie built over a subset of strings

and front-coding over buckets

(26)

Spelling correction

(27)

Spell correction

 Two principal uses

 Correcting document(s) being indexed

 Correcting queries to retrieve “right” answers

 Two main flavors:

 Isolated word



Check each word on its own for misspelling



But what about: from  form

 Context-sensitive is more effective



Look at surrounding words



e.g., I flew form Heathrow to Narita.

Sec. 3.3

(28)

Isolated word correction

 Fundamental premise – there is a lexicon from which the correct spellings come

 Two basic choices for this

 A standard lexicon such as



Webster’s English Dictionary



An “industry-specific” lexicon – hand-maintained

 The lexicon of the indexed corpus



E.g., all words on the web



All names, acronyms etc. (including the mis-spellings)



Mining algorithms to derive the possible corrections

Sec. 3.3.2

(29)

Isolated word correction

 Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q

 What’s “closest”?

 We’ll study several measures

 Edit distance (Levenshtein distance)

 Weighted edit distance

 n-gram overlap

Sec. 3.3.2

(30)

Brute-force check of ED

Given query Q,

 enumerate all character sequences within a preset (weighted) edit distance (e.g., 2)

 Intersect this set with list of “correct” words

 Show terms you found to user as suggestions

Sec. 3.3.4

How the time complexity grows

with #errors allowed and string

Length ?

(31)

Edit distance

 Given two strings S ₁ and S ₂ , the minimum

number of operations to convert one to the other

 Operations are typically character-level

 Insert, Delete, Replace (possibly, Transposition)

 E.g., the edit distance from dof to dog is 1

 From cat to act is 2 (Just 1 with transpose)

 from cat to dog is 3.

 Generally found by dynamic programming.

Sec. 3.3.3

(32)

 Let E(i,j) = edit distance to transform P _1,i in T _1,j

 Consider the sequence of ops: ins, del, subst, match

 Order the sequence of ops by position involved in T

 If P[i]=T[j] then last op is a match, and thus it is not counted

 Otherwise the last op is: subst(P[i],T[j]) or ins(T[j]) or del(P[i])

DynProg for Edit Distance

E(i,0)=i, E(0,j)=j

E(i, j) = E(i–1, j–1) if P[i]=T[j]

E(i, j) = min{E(i, j–1), E(i–1, j),

E(i–1, j–1)}+1 if P[i] T[j]

(33)

Example

T

0 1 2 3 4 5 6

p t t a p a

P

0 0 1 2 3 4 5 6

1 p 1 0 1 2 3 4 5

2 a 2 1 1 2 2 3 4

3 t 3 2 1 1 2 3 4

4 t 4 3 2 1 2 3 4

+1 +1

+1

(34)

Weighted edit distance

 As above, but the weight of an operation depends on the character(s) involved

 Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q

 Therefore, replacing m by n is a smaller cost than by q

 Requires weighted matrix as input

 Modify DP to handle weights

Sec. 3.3.3

Dictionary data structures for the Inverted Index

Dictionary data structures

for the Inverted Index

This lecture

 Dictionary data structures

 Exact search

 Prefix search

 “Tolerant” retrieval

 Edit-distance queries

 Wild-card queries

 Spelling correction

 Soundex

Ch. 3

Basics

 The dictionary data structure stores the term vocabulary, but… in what data

structure?

Sec. 3.1

A naïve dictionary

 An array of struct:

char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes

 How do we store a dictionary in memory efficiently?

 How do we quickly look up elements at query time?

Sec. 3.1

Dictionary data structures

 Two main choices:

 Hash table

 Tree

 Trie

 Some IR systems use hashes, some trees/tries

Sec. 3.1

Hashing with chaining

Key issue: a good hash function

Basic assumption: Uniform hashing

Avg #keys per slot = n * (1/m) = n/m = (load factor)

m = (n)

In practice

A trivial hash function is:

prime

Integer (if string?)

The current version is MurmurHash (vers 3) yields a 32-bit

or 128-bit hash value.

Another “provably good” hash

 Each a i is selected at random in [0,m)

k 0 k 1 k 2 k r

≈log 2 m

r ≈ L / log 2 m

a 0 a 1 a 2 a r

Key

a

prime

L = bit len of string m = table size

not necessarily: (...mod p) mod m

Hashes

 Each vocabulary term is hashed to an integer

 Pros:

 Lookup is faster than for a tree: O(1) on avg

 Turn to worst-case by Cuckoo hashing

 Cons:

 Only exact search: e.g. judgment/judgement

 If vocabulary keeps growing, you need to occasionally do the expensive rehashing

 Cuckoo hash «limits» those rehashes

Sec. 3.1

Cuckoo Hashing

A

2 hash tables, and 2 random choices where an item can be

stored

A B C

E D

F

A running example

A B C F

E D

A running example

A B C F

E D

G

A running example

E G B C F

A D

A running example

 Each a _i is selected at random in [0,m)

k ₀ k ₁ k ₂ k _r

≈log ₂ m

r ≈ L / log ₂ m

a ₀ a ₁ a ₂ a _r

D ^F