• Non ci sono risultati.

...and in the fields, represent them


Academic year: 2021

Condividi "...and in the fields, represent them"


Testo completo


Information Retrieval - University of Pisa

...and in the fields, represent them

Marco Cornolti


Previously, on IR

Lucene: search engine on a document collection. Two operations:

1. Index 2. Search

Analyzer: transforms text (and queries) into tokens. In order for a document to appear as result, it must

contain at least one token identical to a query token.


Search: workflow

N.A.T.O Obama

nato, obama

user query

for each doc: is there any token identical to one in the query?

search algorithm

document 3

document 1

document 2

documents ordered by relevance

document 1 document 2 document 3

[tokens 1]

[tokens 2]

[tokens 3]


representation of query: list of tokens indexed documents


representation of doc: list of tokens




document-query match score



the U.S.A. President Trump, yesterday, spoke at UN.

field query

Trump usa



WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer


the GDP shrinked WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer



WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer

YES YES YES the U.S.A. President

Trump, yesterday, spoke at UN.




● An indexed document is a set of fields

● We can make field-specific search

● Each field can use a different analyzer

● Examples:

Newspaper article:

● Author

● Title

● Text

● Category

● Date

WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer

Item in a shop:

● Name

● Description

● Brand

● Product code

ClassicAnalyzer KeywordAnalyzer


Aim: Fields

● Index of IMDB movies. For each movie, we extract (with XPath):

○ Title

○ Release year

○ Director

○ Synopsis

● Field-based queries


Have a look at IMDB html code, and come up with XPaths to extract the information we need.

edit html_to_data in processhtml.py:

● Extract movie title

● Extract synopsis

Make a script that prints all information for those pages.

Parsing data from html


Create index: Analyzer

edit create_index.py:

import lucene import sys

from lucene import *

from processhtml import * lucene.initVM()

dir_name = "movies_index"

index_dir = SimpleFSDirectory(File(dir_name))

analyzer = PerFieldAnalyzerWrapper(ClassicAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("synopsis", ClassicAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("year", KeywordAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("title", ClassicAnalyzer(Version.LUCENE_35))

analyzer.addAnalyzer("director", WhitespaceAnalyzer(Version.LUCENE_35))

writer = IndexWriter(index_dir, analyzer, True, IndexWriter.MaxFieldLength.UNLIMITED) [...Cycle over all movies...]

create a distinct analyzer for each field


Create index: document

edit index.py:

[...Cycle over all files...]

[... print, get title, date, synopsis, director …]

doc = Document()

doc.add(Field("title", title, Field.Store.YES, Field.Index.ANALYZED))

if date: doc.add(Field("date", date, Field.Store.YES, Field.Index.ANALYZED))

if synopsis: doc.add(Field("synopsis", synopsis, Field.Store.YES, Field.Index.ANALYZED)) if director: doc.add(Field("director", director, Field.Store.YES, Field.Index.ANALYZED)) writer.addDocument(doc)

print "Currently there are {0} documents in the index...".format(writer.numDocs()) writer.optimize()

writer.close() after main loop


Search by field (1)

edit search_index.py:

import lucene import sys

from lucene import * lucene.initVM()

dir_name = "movies_index"

index_dir = SimpleFSDirectory(File(dir_name)) searcher = IndexSearcher(index_dir)

classic_analyzer = ClassicAnalyzer(Version.LUCENE_35)

whitespace_analyzer = WhitespaceAnalyzer(Version.LUCENE_35) title_terms = raw_input("Insert a title:")

synopsis_terms = raw_input("Insert synopsis terms:") query = BooleanQuery()

if synopsis_terms:

synopsis_query = QueryParser(Version.LUCENE_35, "synopsis", classic_analyzer).parse(synopsis_terms) query.add(synopsis_query, BooleanClause.Occur.MUST)

if title_terms:

title_query = QueryParser(Version.LUCENE_35, "title", classic_analyzer).parse(title_terms) query.add(title_query, BooleanClause.Occur.MUST)

same as create_index.py


Search by field (2)

add to search_index.py:

[... follows …]

MAX = 1000

hits = searcher.search(query, MAX)

print u"Found {0} document(s) that matched query '{1}':".format(hits.totalHits, query)

for hit in hits.scoreDocs:

doc = searcher.doc(hit.doc) print "score: ", hit.score print "doc_id: ", hit.doc print "title: ", doc.get("title")

print "director: ", doc.get("director") print


Playing with Lucene

● Add other fields

● Add range queries for date (movies released in 2003-2010)

● Try to index a news site (BBC, CNN, ...)


Documenti correlati

- :l (A.T.R.) Alluvioni sciolte di rocce palcozoichc delle sponde occidcutu li della rossa tcuonica, sopra banchi sino a 40150 metri di spessore di argille plastiche bianche o grigie

T1b Tumor incidental histologic finding in more than 5% of tissue resected (Figure 34.3).. T1c Tumor identified by needle biopsy (e.g., because of elevated PSA) T2 Tumor confined



risposta non e' stata data, oppure e' stata ottenuta con un procedimento completamente errato, oppure non e' stato trovato il procedimento stesso sui fogli consegnati. - quando tra

Se compare "." significa che la risposta non e' stata data, oppure e' stata ottenuta con un procedimento. completamente errato, oppure non e' stato trovato il procedimento

COMPITI DI MATEMATICA per le applicazioni economiche e finanziarie AA.. Si determini invece quali variabili pos-.. sono essere assunte come variabile dipendente, e si

A host of other characteristics such as acquisition memory, display and analysis features, integration with analog tools, and even modularity join forces to make logic analyzers