One index to store them all

(1)

Information Retrieval - University of Pisa

One index to store them all

Marco Cornolti

(2)

Lucene: features

An open source, personalizable search engine that does:

1. Indexing of a document collection 2. Search in a collection

...and many other things!

Has interfaces to almost any language (e.g. PHP).

(3)

Lucene: indexing

document 1 document 2 document 3

document 1000

doc1: good, bad, ugly, spaghetti, western, movie doc2: matrix, neo, morpheus, trinity

doc3: blues, brother, dan, aykroyd, john, belushi

...

text of the document

internal representation:

list of tokens

(4)

Lucene: search

U.S.A. Alien invasion usa, alien, invasion

user query internal representation of query:

list of tokens

search algorithm

document 3 document 1 document 2 document ordering according to relevance

(5)

Lucene: indexing (2)

Internal representation of documents depends on the analyzer we use to parse them.

Examples:

● KeywordAnalyzer: whole input in one singole token (no tokenization)

● WhitespaceAnalyzer: split on spaces

● SimpleAnalyzer: split on non-letters + lowercase

● ClassicAnalyzer: splits according to a set of rules (basically,

non-letters) + lowercase + stopword removal + normalization

(6)

Analyzer examples

Why did Obama, president of U.S.A., spoke at NATO summit?

Why | did | Obama, | president

| of | U.S.A., | spoke | at | NATO | summit?

why | did | obama | president | of | u | s | a | spoke | at | nato | summit

why | obama | president

| usa | spoke | nato | summit

WhitespaceAnalyzer SimpleAnalyzer ClassicAnalyzer

(7)

Aim: Indexing

Build a toy search engine

● Read from input (one document per line)

● Index with Lucene

● Search with Lucene

(8)

Environment

From terminal:

sudo apt-get install python-lucene

(9)

Use of IndexWriter

edit create_index.py:

import lucene import sys

from lucene import * lucene.initVM()

dir_name = "test_index"

index_dir = SimpleFSDirectory(File(dir_name))

analyzer = WhitespaceAnalyzer(Version.LUCENE_35)

writer = IndexWriter(index_dir, analyzer, True, IndexWriter.MaxFieldLength.UNLIMITED) for l in sys.stdin:

doc = Document()

text_field = Field("text", l.strip(), Field.Store.YES, Field.Index.ANALYZED) doc.add(text_field)

writer.addDocument(doc)

print "Currently there are {0} documents in the index...".format(writer.numDocs()) writer.optimize()

writer.close()

Directory on file system to store index Initialize JVM

Create analyzer

Create new index (delete any pre-existing data)

use analyzer

write the whole text (in original form) in the index

(10)

What happens in the file system?

$ ls -ltr test_index/

total 80 -rw-rw-r-- 1 marco marco 44 mar 11 23:43 _0.fdx -rw-rw-r-- 1 marco marco 67 mar 11 23:43 _0.fdt -rw-rw-r-- 1 marco marco 53 mar 11 23:43 _0.tis -rw-rw-r-- 1 marco marco 35 mar 11 23:43 _0.tii -rw-rw-r-- 1 marco marco 9 mar 11 23:43 _0.prx -rw-rw-r-- 1 marco marco 9 mar 11 23:43 _0.nrm -rw-rw-r-- 1 marco marco 8 mar 11 23:43 _0.frq -rw-rw-r-- 1 marco marco 12 mar 11 23:43 _0.fnm -rw-rw-r-- 1 marco marco 20 mar 11 23:43 segments.gen -rw-rw-r-- 1 marco marco 262 mar 11 23:43 segments_1

(11)

Search with IndexReader

edit search_index.py:

import lucene import sys

from lucene import * lucene.initVM()

dir_name = "test_index"

index_dir = SimpleFSDirectory(File(dir_name))

analyzer = WhitespaceAnalyzer(Version.LUCENE_35) searcher = IndexSearcher(index_dir)

print "Insert a query:"

input_query = sys.stdin.readline().strip()

query = QueryParser(Version.LUCENE_35, "text", analyzer).parse(input_query) MAX = 1000

hits = searcher.search(query, MAX)

print u"Found {0} document(s) that matched query '{1}':".format(hits.totalHits, query) for hit in hits.scoreDocs:

doc = searcher.doc(hit.doc)

print u"score:{0} doc_id:{1} text:{2}".format(hit.score, hit.doc, doc.get("text"))

this part must be the same as create_index.py

(12)

Search with WhitespaceA.

Index three documents:

Obama u.s.a. NATO obama usa N.A.T.O.

the cat meows

Search for:

cat

u.s.a.

OBAMA

N.A.T.O.

run

python create_index.py one doc per line, terminate with double CTRL+D

start

python search_index.py

Replace WhitespaceAnalyzer with ClassicAnalyzer, recreate index, try same queries

(13)

One index to store them all

Information Retrieval - University of Pisa