• Non ci sono risultati.

One index to store them all

N/A
N/A
Protected

Academic year: 2021

Condividi "One index to store them all"

Copied!
13
0
0

Testo completo

(1)

Information Retrieval - University of Pisa

One index to store them all

Marco Cornolti

(2)

Lucene: features

An open source, personalizable search engine that does:

1. Indexing of a document collection 2. Search in a collection

...and many other things!

Has interfaces to almost any language (e.g. PHP).

(3)

Lucene: indexing

document 1 document 2 document 3

document 1000

doc1: good, bad, ugly, spaghetti, western, movie doc2: matrix, neo, morpheus, trinity

doc3: blues, brother, dan, aykroyd, john, belushi

...

text of the document

internal representation:

list of tokens

(4)

Lucene: search

U.S.A. Alien invasion usa, alien, invasion

user query internal representation of query:

list of tokens

search algorithm

document 3 document 1 document 2 document ordering according to relevance

(5)

Lucene: indexing (2)

Internal representation of documents depends on the analyzer we use to parse them.

Examples:

● KeywordAnalyzer: whole input in one singole token (no tokenization)

● WhitespaceAnalyzer: split on spaces

● SimpleAnalyzer: split on non-letters + lowercase

● ClassicAnalyzer: splits according to a set of rules (basically,

non-letters) + lowercase + stopword removal + normalization

(6)

Analyzer examples

Why did Obama, president of U.S.A., spoke at NATO summit?

Why | did | Obama, | president

| of | U.S.A., | spoke | at | NATO | summit?

why | did | obama | president | of | u | s | a | spoke | at | nato | summit

why | obama | president

| usa | spoke | nato | summit

WhitespaceAnalyzer SimpleAnalyzer ClassicAnalyzer

(7)

Aim: Indexing

Build a toy search engine

● Read from input (one document per line)

● Index with Lucene

● Search with Lucene

(8)

Environment

From terminal:

sudo apt-get install python-lucene

(9)

Use of IndexWriter

edit create_index.py:

import lucene import sys

from lucene import * lucene.initVM()

dir_name = "test_index"

index_dir = SimpleFSDirectory(File(dir_name))

analyzer = WhitespaceAnalyzer(Version.LUCENE_35)

writer = IndexWriter(index_dir, analyzer, True, IndexWriter.MaxFieldLength.UNLIMITED) for l in sys.stdin:

doc = Document()

text_field = Field("text", l.strip(), Field.Store.YES, Field.Index.ANALYZED) doc.add(text_field)

writer.addDocument(doc)

print "Currently there are {0} documents in the index...".format(writer.numDocs()) writer.optimize()

writer.close()

Directory on file system to store index Initialize JVM

Create analyzer

Create new index (delete any pre-existing data)

use analyzer

write the whole text (in original form) in the index

(10)

What happens in the file system?

$ ls -ltr test_index/

total 80 -rw-rw-r-- 1 marco marco 44 mar 11 23:43 _0.fdx -rw-rw-r-- 1 marco marco 67 mar 11 23:43 _0.fdt -rw-rw-r-- 1 marco marco 53 mar 11 23:43 _0.tis -rw-rw-r-- 1 marco marco 35 mar 11 23:43 _0.tii -rw-rw-r-- 1 marco marco 9 mar 11 23:43 _0.prx -rw-rw-r-- 1 marco marco 9 mar 11 23:43 _0.nrm -rw-rw-r-- 1 marco marco 8 mar 11 23:43 _0.frq -rw-rw-r-- 1 marco marco 12 mar 11 23:43 _0.fnm -rw-rw-r-- 1 marco marco 20 mar 11 23:43 segments.gen -rw-rw-r-- 1 marco marco 262 mar 11 23:43 segments_1

(11)

Search with IndexReader

edit search_index.py:

import lucene import sys

from lucene import * lucene.initVM()

dir_name = "test_index"

index_dir = SimpleFSDirectory(File(dir_name))

analyzer = WhitespaceAnalyzer(Version.LUCENE_35) searcher = IndexSearcher(index_dir)

print "Insert a query:"

input_query = sys.stdin.readline().strip()

query = QueryParser(Version.LUCENE_35, "text", analyzer).parse(input_query) MAX = 1000

hits = searcher.search(query, MAX)

print u"Found {0} document(s) that matched query '{1}':".format(hits.totalHits, query) for hit in hits.scoreDocs:

doc = searcher.doc(hit.doc)

print u"score:{0} doc_id:{1} text:{2}".format(hit.score, hit.doc, doc.get("text"))

this part must be the same as create_index.py

(12)

Search with WhitespaceA.

Index three documents:

Obama u.s.a. NATO obama usa N.A.T.O.

the cat meows

Search for:

cat

u.s.a.

OBAMA

N.A.T.O.

run

python create_index.py one doc per line, terminate with double CTRL+D

start

python search_index.py

Replace WhitespaceAnalyzer with ClassicAnalyzer, recreate index, try same queries

(13)

Code

Get full code at:

http://bit.ly/212HUNl

Riferimenti

Documenti correlati

system of adiponectin (ADN), an insulin-sensitizing and anti- inflammatory adipokine, in tissues known to be the target of chronic diseases associated to obesity

Colpisce il numero elevato di casalinghe (95 casi) e pensionati (73 casi), probabilmente perché sono due categorie che passano buona parte della giornata a casa, quindi

However, as Finegood and Tzur state in their paper, ‘‘the approach taken and the (volume) estimate used will affect the magnitude of S G BRCLAMP and could impact on the conclusion

Norata, PhD Centro SISA per lo Studio della Aterosclerosi Ospedale Bassini Cinisello Balsamo Milano, Italy Department of Pharmacological Sciences University of Milan Milan,

Un esploratore instancabile dunque alla scoperta di quel irraggiungibi- le “filo d’oro” che lega in un unico percor- so evolutivo di conoscenza della sacralità dell’essere umano

In quella fine di settembre, sarà però la visita più volte ripetuta alla chiesa di Santa Maria Novella a modificare il senso delle rilevazioni artistiche del nostro giovane

Now that we have set our tools properly, our next task is to show that, for the class of cogenerated Pfaffian ideals described in Theorem 2.2, the natural generators form a

Il team di Geo4Fun è quindi andato spedito alla proposta di un’intervista al fondatore e CEO di questo bellissimo progetto che mette insieme il mondo dei Geo Podcast e