• Non ci sono risultati.

...and in the fields, represent them

N/A
N/A
Protected

Academic year: 2021

Condividi "...and in the fields, represent them"

Copied!
12
0
0

Testo completo

(1)

Information Retrieval - University of Pisa

...and in the fields, represent them

Marco Cornolti

(2)

Previously, on IR

Lucene: search engine on a document collection. Two operations:

1. Index 2. Search

Analyzer: transforms text (and queries) into tokens. In order for a document to appear as result, it must

contain at least one token identical to a query token.

(3)

Search: workflow

N.A.T.O Obama

nato, obama

user query

for each doc: is there any token identical to one in the query?

search algorithm

document 3

document 1

document 2

documents ordered by relevance

document 1 document 2 document 3

[tokens 1]

[tokens 2]

[tokens 3]

internal

representation of query: list of tokens indexed documents

internal

representation of doc: list of tokens

1.53

1.04

0.62

document-query match score

(4)

Quiz

the U.S.A. President Trump, yesterday, spoke at UN.

field query

Trump usa

match?

analyzer

WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer

NO YES NO

the GDP shrinked WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer

YES NO NO

PC3421D

WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer

YES YES YES the U.S.A. President

Trump, yesterday, spoke at UN.

PC3421D

(5)

Fields

● An indexed document is a set of fields

● We can make field-specific search

● Each field can use a different analyzer

● Examples:

Newspaper article:

● Author

● Title

● Text

● Category

● Date

WhitespaceAnalyzer ClassicAnalyzer KeywordAnalyzer

Item in a shop:

● Name

● Description

● Brand

● Product code

ClassicAnalyzer KeywordAnalyzer

(6)

Aim: Fields

● Index of IMDB movies. For each movie, we extract (with XPath):

○ Title

○ Release year

○ Director

○ Synopsis

● Field-based queries

(7)

Have a look at IMDB html code, and come up with XPaths to extract the information we need.

edit html_to_data in processhtml.py:

● Extract movie title

● Extract synopsis

Make a script that prints all information for those pages.

Parsing data from html

(8)

Create index: Analyzer

edit create_index.py:

import lucene import sys

from lucene import *

from processhtml import * lucene.initVM()

dir_name = "movies_index"

index_dir = SimpleFSDirectory(File(dir_name))

analyzer = PerFieldAnalyzerWrapper(ClassicAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("synopsis", ClassicAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("year", KeywordAnalyzer(Version.LUCENE_35)) analyzer.addAnalyzer("title", ClassicAnalyzer(Version.LUCENE_35))

analyzer.addAnalyzer("director", WhitespaceAnalyzer(Version.LUCENE_35))

writer = IndexWriter(index_dir, analyzer, True, IndexWriter.MaxFieldLength.UNLIMITED) [...Cycle over all movies...]

create a distinct analyzer for each field

(9)

Create index: document

edit index.py:

[...Cycle over all files...]

[... print, get title, date, synopsis, director …]

doc = Document()

doc.add(Field("title", title, Field.Store.YES, Field.Index.ANALYZED))

if date: doc.add(Field("date", date, Field.Store.YES, Field.Index.ANALYZED))

if synopsis: doc.add(Field("synopsis", synopsis, Field.Store.YES, Field.Index.ANALYZED)) if director: doc.add(Field("director", director, Field.Store.YES, Field.Index.ANALYZED)) writer.addDocument(doc)

print "Currently there are {0} documents in the index...".format(writer.numDocs()) writer.optimize()

writer.close() after main loop

(10)

Search by field (1)

edit search_index.py:

import lucene import sys

from lucene import * lucene.initVM()

dir_name = "movies_index"

index_dir = SimpleFSDirectory(File(dir_name)) searcher = IndexSearcher(index_dir)

classic_analyzer = ClassicAnalyzer(Version.LUCENE_35)

whitespace_analyzer = WhitespaceAnalyzer(Version.LUCENE_35) title_terms = raw_input("Insert a title:")

synopsis_terms = raw_input("Insert synopsis terms:") query = BooleanQuery()

if synopsis_terms:

synopsis_query = QueryParser(Version.LUCENE_35, "synopsis", classic_analyzer).parse(synopsis_terms) query.add(synopsis_query, BooleanClause.Occur.MUST)

if title_terms:

title_query = QueryParser(Version.LUCENE_35, "title", classic_analyzer).parse(title_terms) query.add(title_query, BooleanClause.Occur.MUST)

same as create_index.py

(11)

Search by field (2)

add to search_index.py:

[... follows …]

MAX = 1000

hits = searcher.search(query, MAX)

print u"Found {0} document(s) that matched query '{1}':".format(hits.totalHits, query)

for hit in hits.scoreDocs:

doc = searcher.doc(hit.doc) print "score: ", hit.score print "doc_id: ", hit.doc print "title: ", doc.get("title")

print "director: ", doc.get("director") print

(12)

Playing with Lucene

● Add other fields

● Add range queries for date (movies released in 2003-2010)

● Try to index a news site (BBC, CNN, ...)

Riferimenti

Documenti correlati

- :l (A.T.R.) Alluvioni sciolte di rocce palcozoichc delle sponde occidcutu li della rossa tcuonica, sopra banchi sino a 40150 metri di spessore di argille plastiche bianche o grigie

T1b Tumor incidental histologic finding in more than 5% of tissue resected (Figure 34.3).. T1c Tumor identified by needle biopsy (e.g., because of elevated PSA) T2 Tumor confined

[r]

[r]

risposta non e' stata data, oppure e' stata ottenuta con un procedimento completamente errato, oppure non e' stato trovato il procedimento stesso sui fogli consegnati. - quando tra

Se compare "." significa che la risposta non e' stata data, oppure e' stata ottenuta con un procedimento. completamente errato, oppure non e' stato trovato il procedimento

COMPITI DI MATEMATICA per le applicazioni economiche e finanziarie AA.. Si determini invece quali variabili pos-.. sono essere assunte come variabile dipendente, e si

A host of other characteristics such as acquisition memory, display and analysis features, integration with analog tools, and even modularity join forces to make logic analyzers