• Non ci sono risultati.

Information Retrieval (Lab Test) 01 February 2017

N/A
N/A
Protected

Academic year: 2021

Condividi "Information Retrieval (Lab Test) 01 February 2017"

Copied!
1
0
0

Testo completo

(1)

Information Retrieval (Lab Test) 01 February 2017

Given the XML file at http://tinyurl.com/ir-lab-articles containing a list of 40,000 scientific articles delimited by the tag <article>…</article>. The goal is to:

1. Create a Lucene index with one document per article, each including three fields:

◦ The author name (XML element <author>)

◦ The article title (XML element <title>)

◦ The year name (XML element <year>) Notes:

◦ You can assume that each <article> element has exactly one title, one year and at least one author. In case of multiple authors, their names must be concatenated into a single field.

◦ Author names and year must be tokenized with a WhiteSpaceAnalyzer, and title must be tokenized with the ClassicAnalyzer.

2. Build a search engine that lets users to search articles via a command line interface in which she has to specify some keywords for the author name, and a range of years input as “starting year” and “ending year” (extremes included). The search engine must support the following query: the author names must be matched by the corresponding keywords; and the publishing year must occur in the specified range (hint: this can be implemented by generating and issuing a sequence of queries, one per year of the queried range, that are composed by transforming every year of the queried range into a string to be exactly-matched against the content of the field <year>). The resulting docIDs and their scores have to be concatenated and then sorted by score. The top-10 results have to be printed, along with all fields and ranking score.

Tips:

 We suggest the following procedure:

1. Build a function that processes the XML files and gets the field values 2. Test this function to make sure the fields are correctly extracted 3. Create a script to build the index

4. Create a script to search the index and run a few test queries

 Relative Xpath queries can be run on single nodes, e.g.

article_node.xpath(“./author/text()”) will get the list of text content of all direct children of article_node having tag “author”

Riferimenti

Documenti correlati

The renewal of the 2018 Pledge, when Ceemet and industriAll Europe joined the European Alliance for Apprenticeships, comes with a joint call to Member States to place

Tacere = TO BE SILENT(sàilent) Chiaccherare = TO TALK(tò:k) Ascoltare = TO LISTEN(lìsen) Discutere = TO ARGUE(à:*giu) Dir la verità = TO TELL THE TRUTH Dire una bugia = TO TELL

Purtroppo è così, siamo spinti dalla grande curiosità, dalla frenesia di poter disporre rapidamente di quel che si può (un po’ anche perché la stessa tecnologia non sta ad

Il contributo espone i risultati dell’analisi “in situ” dello stock edilizio alberghiero della Calcidica, attraverso l’individuazione di 160 strutture ricettive Il

During the shutdown, 72% of the respondents to the questionnaire visited websites or social profiles of museums, in Italy or abroad, and accessed their digital

• A global violence dataset that accounts for “all violent deaths every- where” should include four disaggregated types of data: homicides, deaths among armed groups in

Although this compound was less effective than might have been hoped, the same approach resulted in the development of a series of drugs including combined iron chelator-mono-

Under the heading of Health Services Research and terms such as “access” one can debate whether investment in increasing the knowledge of the carer in the developing world should be