• Non ci sono risultati.

Dictionary data structures for the Inverted Index

N/A
N/A
Protected

Academic year: 2021

Condividi "Dictionary data structures for the Inverted Index"

Copied!
34
0
0

Testo completo

(1)

Dictionary data structures

for the Inverted Index

(2)

This lecture

 Dictionary data structures

 Exact search

 Prefix search

 “Tolerant” retrieval

 Edit-distance queries

 Wild-card queries

 Spelling correction

 Soundex

Ch. 3

(3)

Basics

 The dictionary data structure stores the term vocabulary, but… in what data

structure?

Sec. 3.1

(4)

A naïve dictionary

 An array of struct:

char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes

 How do we store a dictionary in memory efficiently?

 How do we quickly look up elements at query time?

Sec. 3.1

(5)

Dictionary data structures

 Two main choices:

 Hash table

 Tree

 Trie

 Some IR systems use hashes, some trees/tries

Sec. 3.1

(6)

Hashing with chaining

(7)

Key issue: a good hash function

Basic assumption: Uniform hashing

Avg #keys per slot = n * (1/m) = n/m = (load factor)

m = (n)

(8)

In practice

A trivial hash function is:

prime

Integer (if string?)

The current version is MurmurHash (vers 3) yields a 32-bit

or 128-bit hash value.

(9)

Another “provably good” hash

 Each a i is selected at random in [0,m)

k 0 k 1 k 2 k r

≈log 2 m

r ≈ L / log 2 m

a 0 a 1 a 2 a r

Key

a

prime

L = bit len of string m = table size

not necessarily: (...mod p) mod m

(10)

Hashes

 Each vocabulary term is hashed to an integer

 Pros:

 Lookup is faster than for a tree: O(1) on avg

Turn to worst-case by Cuckoo hashing

 Cons:

 Only exact search: e.g. judgment/judgement

 If vocabulary keeps growing, you need to occasionally do the expensive rehashing

 Cuckoo hash «limits» those rehashes

Sec. 3.1

(11)

Cuckoo Hashing

A

2 hash tables, and 2 random choices where an item can be

stored

(12)

A B C

E D

F

A running example

(13)

A B C F

E D

A running example

(14)

A B C F

E D

G

A running example

(15)

E G B C F

A D

A running example

We reverse the traversed edges  keeps track of other position

Random (bipartite) graph: node=cell, edge=key

(16)

1 cycle, not a problem

A B

D F

G

If #cells > 2 * #keys, then low prob of cycling (but <50% space occupancy)

D G

B A

F

(17)

The double cycle (!!!)

E B C

A D F

G

D G C

E B F

A

D A C

E B F

G

D A F

E C G

B

If #cells > 2 * #keys, then low prob of 1 cycle

 Low prob of 2 joined-cycles

(18)

Extensions for speed/space

 More than 2 hashes (choices) per key.

 Very different: hypergraphs instead of graphs.

 Higher memory utilization

3 choices : 90%

4 choices : about 97%

 2 hashes + bins of B-size.

 Balanced allocation and tightly O(1)-size bins

 Insertion sees a tree of possible evict+ins paths

but more insert time (and random access)

more memory

...but more local

(19)

Prefix search

(20)

Prefix-string Search

Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.

Ex. Pattern P is pa

Dict = {abaco, box, paolo, patrizio, pippo,

zoo}

(21)

Trie: speeding-up searches

1

2 2

0

4

5

6

7

2 3

y

s

1 z

stile zyg

5 etic

ial ygy

aibelyite

czecin omo

Pro: O(p) search time = path scan

Cons: edge + node labels + tree structure

(22)

Tries

Do exist many variants and their implementations

 Pros:

 Solves the prefix problem

 Cons:

 Slower: O(p) time, many cache misses

 From 10 to 60 (or, even more) bytes per node

Sec. 3.1

(23)

systile syzygetic syzigial syzygygy szaibelyite szczecin szomo systile szaibelyite

CT

on a sample

2-level indexing

Disk

Internal

Memory A disadvantage:

Trade-off ≈ speed vs space (because of bucket size)

2 advantages:

Search ≈ typically 1 I/O + in-mem comp.

Space ≈ trie built over a subset of strings

(24)

Front-coding: squeezing dict

http://checkmate.com/All_Natural/

http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html

http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html ...

0 http://checkmate.com/All_Natural/

33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html

25 Essence_Oils.html

25 Mineral_Bath_Crystals.html 38 Salt.html

33 Cream.html

33  45%

0 http://checkmate.com/All/Natural/Washcloth.html...

systile syzygetic syzygial syzygy

2 5 5

Gzip may be much better...

(25)

….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….

systile szaielyite CT

on a sample

2-level indexing

Disk

Internal

Memory A disadvantage:

Trade-off ≈ speed vs space (because of bucket size)

2 advantages:

Search ≈ typically 1 I/O + in-mem comp.

Space ≈ Trie built over a subset of strings

and front-coding over buckets

(26)

Spelling correction

(27)

Spell correction

 Two principal uses

 Correcting document(s) being indexed

 Correcting queries to retrieve “right” answers

 Two main flavors:

 Isolated word

Check each word on its own for misspelling

But what about: from form

 Context-sensitive is more effective

Look at surrounding words

e.g., I flew form Heathrow to Narita.

Sec. 3.3

(28)

Isolated word correction

 Fundamental premise – there is a lexicon from which the correct spellings come

 Two basic choices for this

 A standard lexicon such as

Webster’s English Dictionary

An “industry-specific” lexicon – hand-maintained

 The lexicon of the indexed corpus

E.g., all words on the web

All names, acronyms etc. (including the mis-spellings)

Mining algorithms to derive the possible corrections

Sec. 3.3.2

(29)

Isolated word correction

 Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q

 What’s “closest”?

 We’ll study several measures

 Edit distance (Levenshtein distance)

 Weighted edit distance

n-gram overlap

Sec. 3.3.2

(30)

Brute-force check of ED

Given query Q,

 enumerate all character sequences within a preset (weighted) edit distance (e.g., 2)

 Intersect this set with list of “correct” words

 Show terms you found to user as suggestions

Sec. 3.3.4

How the time complexity grows

with #errors allowed and string

Length ?

(31)

Edit distance

Given two strings S 1 and S 2 , the minimum

number of operations to convert one to the other

 Operations are typically character-level

 Insert, Delete, Replace (possibly, Transposition)

E.g., the edit distance from dof to dog is 1

From cat to act is 2 (Just 1 with transpose)

from cat to dog is 3.

Generally found by dynamic programming.

Sec. 3.3.3

(32)

Let E(i,j) = edit distance to transform P 1,i in T 1,j

 Consider the sequence of ops: ins, del, subst, match

 Order the sequence of ops by position involved in T

If P[i]=T[j] then last op is a match, and thus it is not counted

Otherwise the last op is: subst(P[i],T[j]) or ins(T[j]) or del(P[i])

DynProg for Edit Distance

E(i,0)=i, E(0,j)=j

E(i, j) = E(i–1, j–1) if P[i]=T[j]

E(i, j) = min{E(i, j–1), E(i–1, j),

E(i–1, j–1)}+1 if P[i] T[j]

(33)

Example

T

0 1 2 3 4 5 6

p t t a p a

P

0 0 1 2 3 4 5 6

1 p 1 0 1 2 3 4 5

2 a 2 1 1 2 2 3 4

3 t 3 2 1 1 2 3 4

4 t 4 3 2 1 2 3 4

+1 +1

+1

(34)

Weighted edit distance

 As above, but the weight of an operation depends on the character(s) involved

Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q

Therefore, replacing m by n is a smaller cost than by q

 Requires weighted matrix as input

 Modify DP to handle weights

Sec. 3.3.3

Riferimenti

Documenti correlati

Definition (the preorder induced by a separable capitalization factor). Let ≥ fe the usual preorder of the financial events. We shall give two proofs: algebraic and geometrical. the

 The search in the hash table of D2 now checks the strong- hash match to return the original strings in D1 (that need a post-filter step).. This induces further False Positives, but

● ClassicAnalyzer: splits according to a set of rules (basically, non-letters) + lowercase + stopword removal + normalization..

Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them... Front-coding:

images inaZude the point, i.e.. - Bv Prooosition 2, we have only to Drove that an o-regular function is c. o-regular iff ii) is true.. Then let f be a

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their creator with certain inalienable rights, that among these are life, liberty,

So, starting from 1977, Ziv and Lempel introduced a family of compressors which addressed successfully these problems by designing two algorithms, named LZ77 and LZ78 from the

The sliding window used by LZ77 from the one hand speeds up the search for the longest phrase to encode, but from the other hand limits the search space, and thus the