Information Retrieval - University of Pisa
One index to store them all
Marco Cornolti
Lucene: features
An open source, personalizable search engine that does:
1. Indexing of a document collection 2. Search in a collection
...and many other things!
Has interfaces to almost any language (e.g. PHP).
Lucene: indexing
document 1 document 2 document 3
document 1000
doc1: good, bad, ugly, spaghetti, western, movie doc2: matrix, neo, morpheus, trinity
doc3: blues, brother, dan, aykroyd, john, belushi
...
text of the document
internal representation:
list of tokens
Lucene: search
U.S.A. Alien invasion usa, alien, invasion
user query internal representation of query:
list of tokens
search algorithm
document 3 document 1 document 2 document ordering according to relevance
Lucene: indexing (2)
Internal representation of documents depends on the analyzer we use to parse them.
Examples:
● KeywordAnalyzer: whole input in one singole token (no tokenization)
● WhitespaceAnalyzer: split on spaces
● SimpleAnalyzer: split on non-letters + lowercase
● ClassicAnalyzer: splits according to a set of rules (basically,
non-letters) + lowercase + stopword removal + normalization
Analyzer examples
Why did Obama, president of U.S.A., spoke at NATO summit?
Why | did | Obama, | president
| of | U.S.A., | spoke | at | NATO | summit?
why | did | obama | president | of | u | s | a | spoke | at | nato | summit
why | obama | president
| usa | spoke | nato | summit
WhitespaceAnalyzer SimpleAnalyzer ClassicAnalyzer
Aim: Indexing
Build a toy search engine
● Read from input (one document per line)
● Index with Lucene
● Search with Lucene
Environment
From terminal:
sudo apt-get install python-lucene
Use of IndexWriter
edit create_index.py:
import lucene import sys
from lucene import * lucene.initVM()
dir_name = "test_index"
index_dir = SimpleFSDirectory(File(dir_name))
analyzer = WhitespaceAnalyzer(Version.LUCENE_35)
writer = IndexWriter(index_dir, analyzer, True, IndexWriter.MaxFieldLength.UNLIMITED) for l in sys.stdin:
doc = Document()
text_field = Field("text", l.strip(), Field.Store.YES, Field.Index.ANALYZED) doc.add(text_field)
writer.addDocument(doc)
print "Currently there are {0} documents in the index...".format(writer.numDocs()) writer.optimize()
writer.close()
Directory on file system to store index Initialize JVM
Create analyzer
Create new index (delete any pre-existing data)
use analyzer
write the whole text (in original form) in the index
What happens in the file system?
$ ls -ltr test_index/
total 80 -rw-rw-r-- 1 marco marco 44 mar 11 23:43 _0.fdx -rw-rw-r-- 1 marco marco 67 mar 11 23:43 _0.fdt -rw-rw-r-- 1 marco marco 53 mar 11 23:43 _0.tis -rw-rw-r-- 1 marco marco 35 mar 11 23:43 _0.tii -rw-rw-r-- 1 marco marco 9 mar 11 23:43 _0.prx -rw-rw-r-- 1 marco marco 9 mar 11 23:43 _0.nrm -rw-rw-r-- 1 marco marco 8 mar 11 23:43 _0.frq -rw-rw-r-- 1 marco marco 12 mar 11 23:43 _0.fnm -rw-rw-r-- 1 marco marco 20 mar 11 23:43 segments.gen -rw-rw-r-- 1 marco marco 262 mar 11 23:43 segments_1
Search with IndexReader
edit search_index.py:
import lucene import sys
from lucene import * lucene.initVM()
dir_name = "test_index"
index_dir = SimpleFSDirectory(File(dir_name))
analyzer = WhitespaceAnalyzer(Version.LUCENE_35) searcher = IndexSearcher(index_dir)
print "Insert a query:"
input_query = sys.stdin.readline().strip()
query = QueryParser(Version.LUCENE_35, "text", analyzer).parse(input_query) MAX = 1000
hits = searcher.search(query, MAX)
print u"Found {0} document(s) that matched query '{1}':".format(hits.totalHits, query) for hit in hits.scoreDocs:
doc = searcher.doc(hit.doc)
print u"score:{0} doc_id:{1} text:{2}".format(hit.score, hit.doc, doc.get("text"))
this part must be the same as create_index.py
Search with WhitespaceA.
Index three documents:
Obama u.s.a. NATO obama usa N.A.T.O.
the cat meows
Search for:
cat
u.s.a.
OBAMA
N.A.T.O.
run
python create_index.py one doc per line, terminate with double CTRL+D
start
python search_index.py
Replace WhitespaceAnalyzer with ClassicAnalyzer, recreate index, try same queries