Information Retrieval - University of Pisa
...and in the fields, represent them
Marco Cornolti
Previously, on IR
Lucene: search engine on a document collection. Two operations:
1. Index 2. Search
Analyzer: transforms text (and queries) into tokens. In order for a document to appear as result, it must
contain at least one token identical to a query token.
Search: workflow
N.A.T.O Obama
nato, obama
user query
for each doc: is there any token identical to one in the query?
search algorithm
document 3
document 1
document 2
documents ordered by relevance
document 1 document 2 document 3
[tokens 1]
[tokens 2]
[tokens 3]
internal
representation of query: list of tokens indexed documents
internal
representation of doc: list of tokens
1.53
1.04
0.62
document-query match score
Quiz
the U.S.A. President Trump, yesterday, spoke at UN.
field query
Trump usa
match?
analyzer
WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer
NO YES NO
the GDP shrinked WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer
YES NO NO
PC3421D
WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer
YES YES YES the U.S.A. President
Trump, yesterday, spoke at UN.
PC3421D
Fields
● An indexed document is a set of fields
● We can make field-specific search
● Each field can use a different analyzer
● Examples:
Newspaper article:
● Author
● Title
● Text
● Category
● Date
WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer
Item in a shop:
● Name
● Description
● Brand
● Product code
ClassicAnalyzer KeywordAnalyzer
Aim: Fields
● Index of IMDB movies. For each movie, we extract (with XPath):
○ Title
○ Release year
○ Director
○ Synopsis
● Field-based queries
Have a look at IMDB html code, and come up with XPaths to extract the information we need.
edit html_to_data in processhtml.py:
● Extract movie title
● Extract synopsis
Make a script that prints all information for those pages.
Parsing data from html
Create index: Analyzer
edit create_index.py:
import lucene import sys
from lucene import *
from processhtml import * lucene.initVM()
dir_name = "movies_index"
index_dir = SimpleFSDirectory(File(dir_name))
analyzer = PerFieldAnalyzerWrapper(ClassicAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("synopsis", ClassicAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("year", KeywordAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("title", ClassicAnalyzer(Version.LUCENE_35))
analyzer.addAnalyzer("director", WhitespaceAnalyzer(Version.LUCENE_35))
writer = IndexWriter(index_dir, analyzer, True, IndexWriter.MaxFieldLength.UNLIMITED) [...Cycle over all movies...]
create a distinct analyzer for each field
Create index: document
edit index.py:
[...Cycle over all files...]
[... print, get title, date, synopsis, director …]
doc = Document()
doc.add(Field("title", title, Field.Store.YES, Field.Index.ANALYZED))
if date: doc.add(Field("date", date, Field.Store.YES, Field.Index.ANALYZED))
if synopsis: doc.add(Field("synopsis", synopsis, Field.Store.YES, Field.Index.ANALYZED)) if director: doc.add(Field("director", director, Field.Store.YES, Field.Index.ANALYZED)) writer.addDocument(doc)
print "Currently there are {0} documents in the index...".format(writer.numDocs()) writer.optimize()
writer.close() after main loop
Search by field (1)
edit search_index.py:
import lucene import sys
from lucene import * lucene.initVM()
dir_name = "movies_index"
index_dir = SimpleFSDirectory(File(dir_name)) searcher = IndexSearcher(index_dir)
classic_analyzer = ClassicAnalyzer(Version.LUCENE_35)
whitespace_analyzer = WhitespaceAnalyzer(Version.LUCENE_35) title_terms = raw_input("Insert a title:")
synopsis_terms = raw_input("Insert synopsis terms:") query = BooleanQuery()
if synopsis_terms:
synopsis_query = QueryParser(Version.LUCENE_35, "synopsis", classic_analyzer).parse(synopsis_terms) query.add(synopsis_query, BooleanClause.Occur.MUST)
if title_terms:
title_query = QueryParser(Version.LUCENE_35, "title", classic_analyzer).parse(title_terms) query.add(title_query, BooleanClause.Occur.MUST)
same as create_index.py
Search by field (2)
add to search_index.py:
[... follows …]
MAX = 1000
hits = searcher.search(query, MAX)
print u"Found {0} document(s) that matched query '{1}':".format(hits.totalHits, query)
for hit in hits.scoreDocs:
doc = searcher.doc(hit.doc) print "score: ", hit.score print "doc_id: ", hit.doc print "title: ", doc.get("title")
print "director: ", doc.get("director") print