Index Construction: sorting

(1)

Index Construction:

sorting

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading Chap 4

(2)

Indexer steps

 Dictionary & postings: How do we construct them?

 Scan the texts

 Find the proper tokens-list

 Append to the proper tokens-list the docID

How do we:

• Find ? …time issue…

• Append ? …space issues …

• Postings’ size?

• Dictionary size ? … in-memory issues …

(3)

Indexer steps: create token

 Sequence of pairs:

 < Modified token, Document ID >

 What about var-length strings?

I did enact Julius Caesar I was killed

i' the Capitol;

Brutus killed me.

Doc 1

So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Doc 2

(4)

Indexer steps: Sort

 Sort by Term

 Then sort by DocID

Core indexing step

(5)

Indexer steps: Dictionary &

Postings

 Multiple term entries in a single document are merged.

 Split into Dictionary and Postings

 Doc. frequency

information is added.

Sec. 1.2

(6)

Some key issues…

Pointers 6

Terms and

counts Now:

• How do we sort?

• How much storage is needed ?

Sec. 1.2

Lists of docIDs

(7)

Keep attention on disk...

 If sorting needs to manage strings

Key observations:

 Array A is an “ array of pointers to objects”

 For each object-to-object comparison A[i] vs A[j]:

 2 random accesses to 2 memory locations A[i] and A[j]

 (n log n) random memory accesses (I/Os ??)

Memory containing the strings

A

Again chaching helps, But how much ?

You sort A Not the strings

(8)

Binary Merge-Sort

Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2;

03 Merge-Sort(A,i,m);

04 Merge-Sort(A,m+1,j);

05 Merge(A,i,m,j) Merge-Sort(A,i,j)

01 if (i < j) then 02 m = (i+j)/2;

03 Merge-Sort(A,i,m);

04 Merge-Sort(A,m+1,j);

05 Merge(A,i,m,j)

Divide

Conquer

Combine

1 2 8 10 7 9 13 19 1 2 7

Merge is linear in the

#items to be merged

(9)

Few key observations

 Items = (short) strings = atomic...

 On english wikipedia, about 10⁹ tokens to sort

 (n log n) memory accesses (I/Os ??)



[5ms] * n log

₂

n ≈ 3 years

In practice it is a “faster”, why?

(10)

SPIMI:

Single-pass in-memory indexing

 Key idea #1: Generate separate dictionaries for each block (No need for term  termID)

 Key idea #2: Don’ t sort. Accumulate postings in postings lists as they occur (in internal memory).

 Generate an inverted index for each block.

 More space for postings available

 Compression is possible

 Merge indexes into one big index.

 Easy append with 1 file per posting (docID are

increasing within a block), otherwise multi-merge like

(11)

SPIMI-Invert

(12)

Use the same algorithm for disk?



multi-way merge-sort

aka BSBI: Blocked sort-based Indexing



Mapping term  termID

 to be kept in memory for constructing the pairs

 Consumes memory to pairs’ creation

 Needs two passes, unless you use hashing and thus some probability of collision.

(13)

Recursion

10 2

5 1

13 19 13 19

9 7

15 4

8 3

12 17 12 17

6 11

6 11 10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11 10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11

10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11

log₂N

(14)

Implicit Caching…

10 2

2 10

5 1

1 5

13 19 13 19

9 7

7 9

15 4

4 15

8 3

3 8

12 17 12 17

6 11

6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17

1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19

log₂N

N/M runs, each sorted in internal memory (no I/Os)

M

2 passes (one Read/one Write) = 2 * (N/B) I/Os

— I/O-cost for binary merge-sort is ≈ 2 (N/B) log₂ (N/M) Log 2 (N/M)

2 passes (R/W)

(15)

B

A key inefficiency

1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17

B

After few steps, every run is longer than B !!!

B

We are using only 3 pages

But memory contains M/B pages ≈ 2³⁰/2¹⁵= 2¹⁵

B

Output

Buffer

Disk

1, 2, 3

Output Run

4, ...

(16)

Multi-way Merge-Sort

 Sort N items with main-memory M and disk-pages B:

 Pass 1: Produce (N/M) sorted runs.

 Pass i: merge X = M/B-1 runs  log_X N/M passes

Main memory buffers of B items

Pg for run1

Pg for run X

Out Pg

Disk Disk

Pg for run 2

. . . . .

.

. . .

(17)

How it works

1 2 5 10 7 9 13 19 1 2 5 7….

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19

M

N/M runs, each sorted in internal memory = 2 (N/B) I/Os

2 passes (one Read/one Write) = 2 * (N/B) I/Os

— I/O-cost for X-way merge is ≈ 2 (N/B) I/Os per level Log X (N/M)

M

X

(18)

Cost of Multi-way Merge-Sort

 Number of passes = log_X N/M  log_M/B (N/M)

 Total I/O-cost is ( (N/B) log_M/B N/M ) I/Os

 Large fan-out (M/B) decreases #passes In practice

 M/B ≈ 10⁵  #passes =

1

 few mins

Tuning depends on disk features

 Compression would decrease the cost of a pass!

(19)

Distributed indexing

 For web-scale indexing: must use a distributed computing cluster of PCs

 Individual machines are fault-prone

 Can unpredictably slow down or fail

 How do we exploit such a pool of machines?

(20)

Distributed indexing

 Maintain a master machine directing the indexing job – considered “ safe” .

 Break up indexing into sets of (parallel) tasks.

 Master machine assigns tasks to idle machines

 Other machines can play many roles during the computation

(21)

Parallel tasks

 We will use two sets of parallel tasks

 Parsers and Inverters

 Break the document collection in two ways:

• Term-based partition

one machine handles a subrange of terms

• Doc-based partition

one machine handles a subrange of documents

• Term-based partition

one machine handles a subrange of terms

• Doc-based partition

one machine handles a subrange of documents

(22)

Data flow: doc-based partitioning

splits

Parser Parser

Parser

Master

Inverter

Inverter Inverter

Postings IL₁

assign assign

IL₂ IL_k Set₁

Set₂

Set_k

Each query-term goes to many machines

(23)

Data flow: term-based partitioning

splits

Parser Parser

Parser

Master

a-f g-p q-z a-f g-p q-z

a-f g-p q-z

Inverter

Inverter Inverter

Postings a-f

g-p q-z

assign assign

Set₁ Set₂

Set_k

Each query-term goes to one machine

(24)

MapReduce

 This is

 a robust and conceptually simple framework for distributed computing

 … without having to write code for the distribution part.

 Google indexing system (ca. 2002) consists of a number of phases, each implemented in MapReduce.

(25)

Data flow: term-based partitioning

splits

Parser Parser

Parser

Master

a-f g-p q-z a-f g-p q-z

a-f g-p q-z

Inverter

Inverter Inverter

Postings a-f

g-p q-z

assign assign

Map

phase Reduce

phase Segment

files

(local disks)

16Mb -64Mb

Guarantee fitting in

one machine ? Guarantee

fitting in one machine ? Guarantee

fitting in one machine ?

(26)

Dynamic indexing

 Up to now, we have assumed static collections.

 Now more frequently occurs that:

 Documents come in over time

 Documents are deleted and modified

 And this induces:

 Postings updates for terms already in dictionary

 New terms added/deleted to/from dictionary

(27)

Simplest approach

 Maintain “ big” main index

 New docs go into “ small” auxiliary index

 Search across both, and merge the results

 Deletions

 Invalidation bit-vector for deleted docs

 Filter search results (i.e. docs) by the invalidation bit-vector

 Periodically, re-index into one main index

(28)

Issues with 2 indexes

 Poor performance

 Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list.

 Merge is the same as a simple append [new docIDs are greater].

 But this needs a lot of files – inefficient for O/S.

 In reality: Use a scheme somewhere in

between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)

(29)

Logarithmic merge

 Maintain a series of indexes, each twice as

large as the previous one: 2⁰M, 2¹M , 2²M , 2³ M , …

 Keep a small index (Z) in memory (of size 2⁰M)

 Store I0, I1, … on disk (sizes 2⁰M , 2¹M , …)

 If Z gets too big (> M), write to disk as I0

or merge with I₀ (if I₀ already exists)

 Either write merge Z to disk as I₁ (if no I₁) or merge with I₁ to form a larger Z

 etc.

# indexes = logarithmic # indexes = logarithmic

(30)

Some analysis (T = #postings =

#tokens)

 Auxiliary and main index: index construction time is O(T²) as each posting is touched in each merge.

(This would be O(n²) if we count O(1) the cost of processing one document.)

 Logarithmic merge: Each posting is merged O(log (T/M) ) times, so the total time cost is

O( T log (T/M) ). (This would be O(n log n/M) we

consider O(1) the cost of processing one document.)

Logarithmic merge is more efficient for index construction, but its query processing involves O( log (T/M) ) indexes

(31)

Web search engines

 Most search engines now support dynamic indexing

 News items, blogs, new topical web pages

 But (sometimes/typically) they also periodically reconstruct the index

 Query processing is then switched to the

new index, and the old index is then deleted