...and in the fields, represent them

(1)

Information Retrieval - University of Pisa

...and in the fields, represent them

Marco Cornolti

(2)

Previously, on IR

Lucene: search engine on a document collection. Two operations:

1. Index 2. Search

Analyzer: transforms text (and queries) into tokens. In order for a document to appear as result, it must

contain at least one token identical to a query token.

(3)

Search: workflow

N.A.T.O Obama

nato, obama

user query

for each doc: is there any token identical to one in the query?

search algorithm

document 3

document 1

document 2

documents ordered by relevance

document 1 document 2 document 3

[tokens 1]

[tokens 2]

[tokens 3]

internal

representation of query: list of tokens indexed documents

internal

representation of doc: list of tokens

1.53

1.04

0.62

document-query match score

(4)

Quiz

the U.S.A. President Trump, yesterday, spoke at UN.

field query

Trump usa

match?

analyzer

WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer

NO YES NO

the GDP shrinked WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer

YES NO NO

PC3421D

YES YES YES the U.S.A. President

Trump, yesterday, spoke at UN.

PC3421D

(5)

Fields

● An indexed document is a set of fields

● We can make field-specific search

● Each field can use a different analyzer

● Examples:

Newspaper article:

● Author

● Title

● Text

● Category

● Date

Item in a shop:

● Name

● Description

● Brand

● Product code

ClassicAnalyzer KeywordAnalyzer

(6)

Aim: Fields

● Index of IMDB movies. For each movie, we extract (with XPath):

○ Title

○ Release year

○ Director

○ Synopsis

● Field-based queries

(7)

Have a look at IMDB html code, and come up with XPaths to extract the information we need.

edit html_to_data in processhtml.py:

● Extract movie title

● Extract synopsis

Make a script that prints all information for those pages.

Parsing data from html

(8)

Create index: Analyzer

edit create_index.py:

import lucene import sys

from lucene import *

from processhtml import * lucene.initVM()

dir_name = "movies_index"

index_dir = SimpleFSDirectory(File(dir_name))

analyzer = PerFieldAnalyzerWrapper(ClassicAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("synopsis", ClassicAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("year", KeywordAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("title", ClassicAnalyzer(Version.LUCENE_35))

analyzer.addAnalyzer("director", WhitespaceAnalyzer(Version.LUCENE_35))

writer = IndexWriter(index_dir, analyzer, True, IndexWriter.MaxFieldLength.UNLIMITED) [...Cycle over all movies...]

create a distinct analyzer for each field

(9)

Create index: document

edit index.py:

[...Cycle over all files...]

[... print, get title, date, synopsis, director …]

doc = Document()

doc.add(Field("title", title, Field.Store.YES, Field.Index.ANALYZED))

if date: doc.add(Field("date", date, Field.Store.YES, Field.Index.ANALYZED))

if synopsis: doc.add(Field("synopsis", synopsis, Field.Store.YES, Field.Index.ANALYZED)) if director: doc.add(Field("director", director, Field.Store.YES, Field.Index.ANALYZED)) writer.addDocument(doc)

print "Currently there are {0} documents in the index...".format(writer.numDocs()) writer.optimize()

writer.close() after main loop

(10)

Search by field (1)

edit search_index.py:

import lucene import sys

from lucene import * lucene.initVM()

dir_name = "movies_index"

index_dir = SimpleFSDirectory(File(dir_name)) searcher = IndexSearcher(index_dir)

classic_analyzer = ClassicAnalyzer(Version.LUCENE_35)

whitespace_analyzer = WhitespaceAnalyzer(Version.LUCENE_35) title_terms = raw_input("Insert a title:")

synopsis_terms = raw_input("Insert synopsis terms:") query = BooleanQuery()

if synopsis_terms:

synopsis_query = QueryParser(Version.LUCENE_35, "synopsis", classic_analyzer).parse(synopsis_terms) query.add(synopsis_query, BooleanClause.Occur.MUST)

if title_terms:

title_query = QueryParser(Version.LUCENE_35, "title", classic_analyzer).parse(title_terms) query.add(title_query, BooleanClause.Occur.MUST)

same as create_index.py

(11)

Search by field (2)

add to search_index.py:

[... follows …]

MAX = 1000

hits = searcher.search(query, MAX)

print u"Found {0} document(s) that matched query '{1}':".format(hits.totalHits, query)

for hit in hits.scoreDocs:

doc = searcher.doc(hit.doc) print "score: ", hit.score print "doc_id: ", hit.doc print "title: ", doc.get("title")

print "director: ", doc.get("director") print

(12)

...and in the fields, represent them

Information Retrieval - University of Pisa