Metodi per il topic gisting - I principali metodi implementati in Java

B.4 I principali metodi implementati in Java

B.4.2 Metodi per il topic gisting

Il metodo ricorsivo implementato per il topic gisting riceve in input:

• un array di stringhe text che contiene i termini del testo su cui si vuole effettuare il topic gisting. `E da notare che nel caso in cui si voglia utilizzare la versione riportata in questo documento come ”tag+concetti” oppure ”tag2+concetti”, `e necessario salvare nell’array solo i termini selezionati utilizzando un tagger per individuare le parti del discorso (per non appesantire la trattazione non si riporta nel seguito tale codice);

• un oggetto di tipo CommonsHttpSolrServer che corrisponde al server Solr in cui ricercare i dati di interesse;

• un intero min che contiene il primo indice d’interesse dell’array; • un intero max che contiene l’ultimo indice d’interesse dell’array;

e restituisce una stringa contenente l’elenco dei termini che costituiscono il topic gist, ossia il succo del discorso.

Come spiegato brevemente nel par. 3.3.4, facendo riferimento al codice riportato in Listing B.2, le operazioni che vengono svolte da questo metodo sono:

Passo base (riga 5-28): Se il testo è composto da un unico termine viene restituito l’insieme di concetti collegati a tale termine (nel caso in cui l’insieme sia vuoto si restituisce il termine stesso); per far ciò si utilizza il metodo getContext() che verrà presentato in maggior dettaglio nel seguito. Se invece il testo è composto da due termini, per ciascuno di essi viene restituito l’insieme dei concetti collegati e si effettua l’intersezione, attraverso il metodo getOverlaps(); nel caso in cui l’intersezione dovesse risultare vuota, viene restituito il testo dato in input. Passo ricorsivo (riga 31-74): Se il testo è composto da più termini, viene suddiviso

in due parti e su ognuna si applica ricorsivamente l’algoritmo. Viene quindi effettuata l’intersezione dei risultati dei due sottoproblemi.

1 p u b l i c s t a t i c S t r i n g m e r g e G i s t ( S t r i n g [ ] t e x t , CommonsHttpSolrServer s e r v e r , i n t

min , i n t max ) t h r o w s P a r s e E x c e p t i o n {

2 HashMap<S t r i n g , I n t e g e r > i n t e r s e c t i o n = n u l l; 3 S t r i n g r e s u l t = ” ”;

5 // c a s o b a s e

6 i n t s i z e = max − min ;

7 i f( s i z e == 0 | | t e x t . l e n g t h ==1){

8 t e x t [ min]= t e x t [ min ] . r e p l a c e A l l (” [ ˆ a−zA−Z0−9\\ s −] ”, ” ”) ;

9 r e t u r n g e t C o n t e x t ( t e x t [ min ] , s e r v e r , 5 0 ) ; 10 } 11 i f( s i z e == 1 | | t e x t . l e n g t h ==2){ 12 S t r i n g tmp = t e x t [ min ] + ” ” + t e x t [ max ] ; 13 S t r i n g c 1 = g e t C o n t e x t ( t e x t [ min ] , s e r v e r , 5 0 ) ; 14 S t r i n g c 2 = g e t C o n t e x t ( t e x t [ max ] , s e r v e r , 5 0 ) ; 15 i f( ! ( c 1 . isEmpty ( )&&c 2 . isEmpty ( ) ) ) {

16 i n t e r s e c t i o n = g e t O v e r l a p s ( c1 , c 2 ) ; 17 } 18 // s e l ’ i n t e r s e z i o n e è v u o t a r e s t i t u i s c o i l t e s t o d i p a r t e n z a 19 i f( i n t e r s e c t i o n . isEmpty ( ) ) { 20 r e s u l t = tmp ; 21 } 22 e l s e{ 23 r e s u l t = i n t e r s e c t i o n . k e y S e t ( ) . t o S t r i n g ( ) ; 24 } 25 r e s u l t = r e s u l t . r e p l a c e A l l (” \ \ [ ”, ” ”) ; 26 r e s u l t = r e s u l t . r e p l a c e A l l (” \ \ ] ”, ” ”) ; 27 r e t u r n r e s u l t ; 28 }// f i n e c a s o b a s e 29 30 // c a l c o l o l ’ i n d i c e i n t e r m e d i o 31 i n t m i d d l e = ( min + max ) / 2 ; 32 33 // s a l v o l a prima metà d e i c o n c e t t i d i p a r t e n z a 34 // c h e u t i l i z z e r ò n e l c a s o l ’ i n t e r s e z i o n e s i a v u o t a 35 S t r i n g [ ] p a r t 1 = new S t r i n g [ middle−min + 1 ] ; 36 i n t j = 0 ; 37 f o r(i n t i=min ; i <=m i d d l e ; i ++){ 38 p a r t 1 [ j ] = t e x t [ i ] ; 39 j ++; 40 } 41 // chiamo r i c o r s i v a m e n t e i l metodo 42 S t r i n g c o n t e x t 1 = m e r g e G i s t ( t e x t , s e r v e r , min , m i d d l e ) ; 43 44 // f a c c i o l o s t e s s o p e r l a s e c o n d a metà 45 S t r i n g [ ] p a r t 2 = new S t r i n g [ max−( m i d d l e +1) + 1 ] ; 46 j = 0 ; 47 f o r(i n t i=m i d d l e +1; i <=max ; i ++){ 48 p a r t 2 [ j ] = t e x t [ i ] ; 49 j ++; 50 } 51 // chiamo r i c o r s i v a m e n t e i l metodo 52 S t r i n g c o n t e x t 2 = m e r g e G i s t ( t e x t , s e r v e r , m i d d l e +1 , max ) ; 53 54 // s e non s o n o v u o t i f a i l ’ i n t e r s e z i o n e

56 i n t e r s e c t i o n = g e t O v e r l a p s ( c o n t e x t 1 , c o n t e x t 2 ) ; 57 } 58 // s e l ’ i n t e r s e z i o n e `e v u o t a u t i l i z z o i c o n c e t t i da c u i 59 // s o n o p a r t i t a ( non l ’ e s p a n s i o n e . . . ) 60 i f( i n t e r s e c t i o n . isEmpty ( ) ) { 61 System . o u t . p r i n t l n (” c o n c a t e n o i r i s u l t a t i : ”) ; 62 r e s u l t = c o n t e x t 1 + ” ” + c o n t e x t 2 ; 63 System . o u t . p r i n t l n ( r e s u l t ) ; 64 } 65 e l s e{ 66 r e s u l t = i n t e r s e c t i o n . k e y S e t ( ) . t o S t r i n g ( ) ; 67 } 68 69 // f a c c i o un po ’ d i p u l i z i a . . . 70 r e s u l t = r e s u l t . r e p l a c e F i r s t (” \ \ [ ”, ” ”) ; 71 r e s u l t = r e s u l t . r e p l a c e F i r s t (” \ \ ] ”, ” ”) ; 72 73 r e t u r n r e s u l t ; 74 }

Listing B.2: Metodo ricorsivo per il topic gisting.

Il metodo getContext(), che restituisce l’insieme di concetti collegati, riceve in input i seguenti parametri:

• una stringa s che contiene il termine di cui si vogliono conoscere i concetti collegati;

• un oggetto di tipo CommonsHttpSolrServer che corrisponde al server Solr in cui verranno effettuate le ricerche;

• un intero i che corrisponde alla massima cardinalit`a dell’insieme di concetti collegati;

e restituisce un elenco di concetti collegati in formato di stringa, con un concetto per riga.

Si riporta in Listing B.3 il codice del metodo che restituisce i concetti collegati di un termine dato in input. Le operazioni che vengono svolte da tale metodo sono:

• creare una query in Solr corrispondente al termine dato in input e processarla (riga 2-13);

• leggere il contenuto della risposta, selezionando solo le relazioni di tipo /r/IsA e /r/HasProperty (riga 16-38);

• salvare i concetti estratti in una stringa risultato (riga 39-45); 1 p u b l i c s t a t i c S t r i n g g e t C o n t e x t ( S t r i n g s , CommonsHttpSolrServer s e r v e r , i n t i ) { 2 t r y{ 3 // c r e o l a q u e r y e i m p o s t o i l numero d i r i s u l t a t i 4 S o l r Q u e r y q u e r y = new S o l r Q u e r y ( ) ; 5 q u e r y . s e t Q u e r y ( s ) ; 6 q u e r y . s e t S t a r t ( 0 ) ; 7 q u e r y . setRows ( i ) ; 8 // i m p o s t o i campi d i i n t e r e s s e 9 q u e r y . s e t F i e l d s (” t e x t ”, ” r e l ”) ; 10 11 // p r o c e s s o l a q u e r y

12 QueryRequest qryReq = new QueryRequest ( q u e r y ) ; 13 QueryResponse r e s p o n s e = qryReq . p r o c e s s ( s e r v e r ) ; 14

15 // s a l v o i r i s u l t a t i i n una s t r i n g a d ’ output , e l i m i n a n d o i d o p p i o n i

16 S t r i n g r = ” ”;

17 Set<S t r i n g > s e t = new LinkedHashSet<S t r i n g >() ; 18 s e t . c l e a r ( ) ;

19 l o n g maxNum = r e s p o n s e . g e t R e s u l t s ( ) . getNumFound ( ) ; 20 i f(maxNum < i ) {

21 i = (i n t) maxNum ;

22 System . o u t . p r i n t l n (” Sono p r e s e n t i meno c o n c e t t i c o l l e g a t i . . . ”) ;

23 } 24 f o r(i n t t= 0 ; t<i ; t++){ 25 S t r i n g r e l = r e s p o n s e . g e t R e s u l t s ( ) . g e t ( t ) . g e t F i e l d V a l u e (” r e l ”) . t o S t r i n g ( ) ; 26 i f( r e l . e q u a l s (” / r / IsA ”) | | r e l . e q u a l s (” / r / H a s P r o p e r t y ”) ) { 27 O b j e c t t e x t = r e s p o n s e . g e t R e s u l t s ( ) . g e t ( t ) . g e t F i e l d V a l u e (” t e x t ”) ; 28 S t r i n g s t r = t e x t . t o S t r i n g ( ) ; 29 S t r i n g d e l i m s = ” [ \ \ [ , \ \ ] ] + ”; 30 S t r i n g [ ] l i s t = s t r . s p l i t ( d e l i m s ) ; 31 f o r(i n t x =0; x< l i s t . l e n g t h ; x++){ 32 l i s t [ x ] = l i s t [ x ] . t r i m ( ) ; // rimuovo g l i s p a z i a l l ’ i n i z i o d e l l e p a r o l e 33 System . o u t . p r i n t l n (”>”+ l i s t [ x ] ) ; 34 } 35 L i s t <S t r i n g > l = A r r a y s . a s L i s t ( l i s t ) ; 36 s e t . a d d A l l ( l ) ; // rimuovo i d o p p i o n i mantenendo l ’ o r d i n e 37 } 38 } 39 S t r i n g [ ] r e s u l t = new S t r i n g [ s e t . s i z e ( ) ] ; 40 s e t . t o A r r a y ( r e s u l t ) ; 41 f o r( S t r i n g tmp : r e s u l t ) { 42 r += tmp + ” \n”; 43 } 44 r e t u r n r ; 45 } 46 c a t c h( E r r o r e ) { 47 r e t u r n ” ”;

48 } 49 c a t c h( E x c e p t i o n S o l r S e r v e r E x c e p t i o n ) { 50 System . o u t . p r i n t l n (” I m p o s s i b i l e t r o v a r e i l c o n c e t t o ”) ; 51 r e t u r n ” ”; 52 } 53 }

Listing B.3: Metodo per l’estrazione dei concetti collegati.

Il metodo getOverlaps(), che effettua l’intersezione tra stringhe, restituendo le parole in comune, riceve in input:

• due stringhe: string0 e string1;

e restituisce una struttura dati di tipo HashMap<String, Integer> che contiene le parole in comune tra le due stringhe e il numero di volte che ciascuna parola compare in entrambe le stringhe.

Tale metodo `e una reimplementazione in Java di una funzione omonima scritta in Perl proposta da Jason Michelizzi, Ted Pedersen, Siddharth Patwardhan, Satanjeev Banerjee e Ying Liu (44). Nel codice riportato in Listing B.4 si sono mantenuti i commenti originali (in inglese).

1 p u b l i c s t a t i c HashMap<S t r i n g , I n t e g e r > g e t O v e r l a p s ( S t r i n g s t r i n g 0 , S t r i n g s t r i n g 1 ) {

3 // R e s u l t s a r e a HashMap w i t h k e y s t h a t a r e t h e o v e r l a p p i n g s t r i n g s and

4 // v a l u e s t h a t a r e t h e f r e q u e n c y c o u n t s o f t h o s e o v e r l a p s

5 HashMap<S t r i n g , I n t e g e r > o v e r l a p s H a s h = new HashMap<S t r i n g , I n t e g e r >() ; 6 i n t m a t c h S t a r t I n d e x = 0 ; 7 i n t c u r r I n d e x = −1; 8 9 // S t r i n g s a r e compared b a s e d on word o v e r l a p s , s o c r e a t e a r r a y s o f 10 // words 11 S t r i n g [ ] S = n u l l, W = n u l l; 12 13 // a s s i g n t h e s h o r t e s t s t r i n g t o v a r i a b l e S 14 i f ( s t r i n g 0 . l e n g t h ( ) > s t r i n g 1 . l e n g t h ( ) ) { 15 W = s t r i n g 0 . t r i m ( ) . s p l i t (” \\ s+”) ; 16 S = s t r i n g 1 . t r i m ( ) . s p l i t (” \\ s+”) ; 17 } 18 e l s e { 19 S = s t r i n g 0 . t r i m ( ) . s p l i t (” \\ s+”) ; 20 W = s t r i n g 1 . t r i m ( ) . s p l i t (” \\ s+”) ; 21 } 22 i n t[ ] o v e r l a p s L e n g t h s = new i n t[W. l e n g t h+S . l e n g t h ] ; 23 24 w h i l e ( c u r r I n d e x < S . l e n g t h − 1 ) {

25 c u r r I n d e x ++; 26 i f ( c o n t a i n s (W, A r r a y s . copyOfRange ( S , m a t c h S t a r t I n d e x , c u r r I n d e x + 1 ) , f a l s e) ) { 27 c o n t i n u e; 28 } 29 e l s e { 30 o v e r l a p s L e n g t h s [ m a t c h S t a r t I n d e x ] = c u r r I n d e x − m a t c h S t a r t I n d e x ; 31 i f ( o v e r l a p s L e n g t h s [ m a t c h S t a r t I n d e x ] > 0 ) { 32 c u r r I n d e x −−; 33 } 34 m a t c h S t a r t I n d e x ++; 35 } 36 } 37 f o r (i n t i = m a t c h S t a r t I n d e x ; i <= c u r r I n d e x ; i ++) { 38 o v e r l a p s L e n g t h s [ i ] = c u r r I n d e x − i + 1 ; 39 } 40 i n t l o n g e s t O v e r l a p = 0 ; 41 f o r (i n t i = 0 ; i < o v e r l a p s L e n g t h s . l e n g t h ; i ++) { 42 i f ( o v e r l a p s L e n g t h s [ i ] > l o n g e s t O v e r l a p ) { 43 l o n g e s t O v e r l a p = o v e r l a p s L e n g t h s [ i ] ; 44 } 45 } 46 47 w h i l e ( l o n g e s t O v e r l a p > 0 ) { 48 f o r (i n t i = 0 ; i <= o v e r l a p s L e n g t h s . l e n g t h − 1 ; i ++) { 49 i f ( o v e r l a p s L e n g t h s [ i ] < l o n g e s t O v e r l a p ) { 50 c o n t i n u e; 51 } 52 i n t s t r i n g E n d = i + l o n g e s t O v e r l a p ; 53 i f ( c o n t a i n s (W, A r r a y s . copyOfRange ( S , i , s t r i n g E n d ) , t r u e) ) { 54 S t r i n g temp = new S t r i n g ( S [ i ] ) ; 55 f o r (i n t j = i + 1 ; j < s t r i n g E n d ; j ++) { 56 temp += ” ” + S [ j ] ; 57 } 58 59 i f ( o v e r l a p s H a s h . c o n t a i n s K e y ( temp ) ) {

60 o v e r l a p s H a s h . put ( temp . t o S t r i n g ( ) , o v e r l a p s H a s h . g e t ( temp ) + 1 ) ; 61 } 62 e l s e { 63 o v e r l a p s H a s h . put ( temp , 1 ) ; 64 } 65 66 // a d j u s t o v e r l a p l e n g t h s f o r w a r d 67 f o r (i n t j = i ; j < i + l o n g e s t O v e r l a p ; j ++) { 68 o v e r l a p s L e n g t h s [ j ] = 0 ; 69 } 70 71 // a d j u s t o v e r l a p l e n g t h s backward 72 f o r (i n t j = i − 1 ; j >= 0 ; j −−) { 73 i f ( o v e r l a p s L e n g t h s [ j ] <= i − j ) 74 b r e a k;

75 o v e r l a p s L e n g t h s [ j ] = i − j ; 76 } 77 } 78 e l s e { 79 i n t k = l o n g e s t O v e r l a p − 1 ; 80 w h i l e ( k > 0 ) { 81 s t r i n g E n d = i + k − 1 ; 82 i f ( c o n t a i n s (W, A r r a y s . copyOfRange ( S , i , s t r i n g E n d ) , f a l s e) ) 83 b r e a k; 84 k−−; 85 } 86 o v e r l a p s L e n g t h s [ i ] = k ; 87 } 88 } 89 l o n g e s t O v e r l a p = C o l l e c t i o n s . max ( A r r a y s . a s L i s t ( l o n g e s t O v e r l a p ) ) − 1 ; 90 } 91 r e t u r n o v e r l a p s H a s h ; 92 }

[1] Lemur project, 2012. www.lemurproject.org/. 49

[2] Banerjee, S., and Pedersen, T. Extended gloss overlaps as a measure of se-mantic relatedness. In Proceedings of the 18th international joint conference on Ar-tificial intelligence (San Francisco, CA, USA, 2003), IJCAI’03, Morgan Kaufmann Publishers Inc., pp. 805–810. 3

[3] Bendersky, M., and Croft, W. B. Discovering key concepts in verbose que-ries. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 2008), SIGIR ’08, ACM, pp. 491–498. 1, 3, 8, 11, 44, 49

[4] Bendersky, M., and Croft, W. B. Analysis of long queries in a large scale search log. In Proceedings of the 2009 workshop on Web Search Click Data (New York, NY, USA, 2009), WSCD ’09, ACM, pp. 8–14. http://videolectures.net/ wscd09_bendersky_alqlssl/. 9, 39

[5] Bendersky, M., Metzler, D., and Croft, W. B. Parameterized concept weighting in verbose queries. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (New York, NY, USA, 2011), SIGIR ’11, ACM, pp. 605–614. 24

[6] Berger, A., and Lafferty, J. Information retrieval as statistical translation. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 1999), SIGIR ’99, ACM, pp. 222–229. 17, 29

[7] Broder, A. A taxonomy of web search. SIGIR Forum 36, 2 (Sept. 2002), 3–10. vii, 2, 32, 44

[8] Conceptnet 4, 2012. http://csc.media.mit.edu/docs/conceptnet/webapi_ client.html. 35

[9] Croft, B., Metzler, D., and Strohman, T. Search Engines: Information Retrieval in Practice, 1st ed. Addison-Wesley Publishing Company, USA, 2009. 14, 16, 17

[10] Croft, W. B. Unsolved problems in search: (and how we approach them). In Proceedings of the 17th ACM conference on Information and knowledge mana-gement (New York, NY, USA, 2008), CIKM ’08, ACM, pp. 1001–1001. http: //videolectures.net/cikm08_croft_upis/. 17

[11] Croft, W. B. Query evolution. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval (Berlin, Heidelberg, 2009), ECIR ’09, Springer-Verlag, pp. 1–1. 8, 9

[12] Dbpedia, 2012. dpedia.org/About. 35

[13] de Marneffe, M.-C., and Manning, C. D. The stanford typed dependencies representation. In Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation (Stroudsburg, PA, USA, 2008), CrossParser ’08, Association for Computational Linguistics, pp. 1–8. 21

[14] Eagle, N., Singh, P., and Pentland, A. Common sense conversations: un-derstanding casual conversation using a common sense database. In Proceedings of the Artificial Intelligence, Information Access, and Mobile Computing Workshop (2003). 38

[15] Echihabi, A., and Marcu, D. A noisy-channel approach to question answe-ring. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1 (Stroudsburg, PA, USA, 2003), ACL ’03, Association for Computational Linguistics, pp. 16–23. 30

[16] Edmundson, H. P. New methods in automatic extracting. J. ACM 16, 2 (Apr. 1969), 264–285. 19

[17] Fundel, K., K¨uffner, R., and Zimmer, R. Relex—relation extraction using dependency parse trees. Bioinformatics 23, 3 (Jan. 2007), 365–371. 20

[18] Guess, A. Bing launches adaptive search, Settembre 2011. http://semanticweb. com/bing-launches-adaptive-search_b23175. 11

[19] Guo, J., Xu, G., Li, H., and Cheng, X. A unified and discriminative model for query refinement. In In SIGIR ’08 (2008), pp. 379–386. 30

[20] Hauff, C., Hiemstra, D., and de Jong, F. A survey of pre-retrieval query performance predictors. CIKM ’08, Napa Valley, California, USA (2008). 40 [21] He, B., and Ounis, J. Inferring query performance using pre-retrieval predictors.

The Eleventh Symposium on String Processing and Infroamtion Retrieval (2004). 40

[22] Hiemstra, D., and de Jong, F. Disambiguation strategies for cross-language in-formation retrieval. In Proceedings of the Third European Conference on Research and Advanced Technology for Digital Libraries, ECDL ’99 (1999). 27

[23] Hoenkamp, E., Bruza, P., Song, D., and Huang, Q. An effective approach to verbose queries using a limited dependencies language model. In Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory (Berlin, Heidelberg, 2009), ICTIR ’09, Springer-Verlag, pp. 116–127. 3, 28

[24] Hovy, E., and Lin, C.-Y. Automated text summarization and the summarist system. In Proceedings of a workshop on held at Baltimore, Maryland: Octo-ber 13-15, 1998 (Stroudsburg, PA, USA, 1998), TIPSTER ’98, Association for Computational Linguistics, pp. 197–214. 38

[25] Jurafsky, D., and Martin, J. H. Speech and Language Processing. Prentice Hall, 2000. 19

[26] Kuc, R. Apache Solr 3.1 Cookbook. Packt Publishing, 2011. 61

[27] Kumaran, G., and Carvalho, V. R. Reducing long queries using query quality predictors. In Proceedings of the 32nd international ACM SIGIR conference on

Research and development in information retrieval (New York, NY, USA, 2009), SIGIR ’09, ACM, pp. 564–571. 1, 11

[28] Lafferty, J. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Morgan Kaufmann, pp. 282–289. 23

[29] Lau, T., and Horvitz, E. Patterns of search: Analyzing and modeling web query refinement. In Proceedings of the Seventh International Conference on User Modeling, Banff, Canada (1998), Springer Wien, pp. 119–128. 12

[30] Lavrenko, V. A Generative Theory of Relevance. Springer Publishing Company, Incorporated, 2010. 27

[31] Lavrenko, V., and Croft, W. B. Relevance Models in Information Retrieval. Kluwer Academic Publishers, 2003, ch. 2. 25, 26, 27

[32] Lease, M. Improved markov random field model for supporting verbose queries. SIGIR (2009). 29

[33] Lehnert, W. G. Computational models of natural language processing. Elsevier North-Holland, Inc., New York, NY, USA, 1984, ch. Narrative complexity based on summarization algorithms, pp. 247–259. 19

[34] Lesk, M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation (New York, NY, USA, 1986), SIGDOC ’86, ACM, pp. 24–26. 3

[35] Littman, M., Dumais, S., and Landauer, T. Automatic cross-language retrieval using latent semantic indexing. In Cross-Language Retrieval (1998). 15 [36] Apache lucene. http://lucene.apache.org/. 33

[37] Luhn, H. P. The automatic creation of literature abstracts. IBM Journal of Research and Development 2, 2 (Apr. 1959), 159–165. 19

[38] Markoff, J. A software secretary that takes charge, Dicembre 2008. http: //www.nytimes.com/2008/12/14/business/14stream.html. 12

[39] Mauldin, M. L. Retreival performance in ferret: A conceptual information re-trieval system. In Proceedings of the 14th International Conference on Research and Development in Information Retrieval, Chicago (1991). 19

[40] McKeown, K. R., and Radev, D. R. Generating summaries of multiple news articles. In Proceedings,18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1995). 19

[41] Melucci, M. Dispensa: Lezioncine sui Motori di Ricerca. Teoria e architetture per Information Retrieval e Machine Learning, 2011. 1, 24, 26, 48

[42] Metzler, D., and Croft, W. B. A markov random field model for term depen-dencies. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 2005), SIGIR ’05, ACM, pp. 472–479. 28

[43] Metzler, D., and Croft, W. B. Latent concept expansion using markov random fields. In Proceedings of the 30th annual international ACM SIGIR confe-rence on Research and development in information retrieval (New York, NY, USA, 2007), SIGIR ’07, ACM, pp. 311–318. 29

[44] Michelizzi, J., Pedersen, T., Patwardhan, S., Banerjee, S., and Liu, Y. Overlapfinder - find overlapping words in strings, 2010. http://cpan.uwinnipeg. ca/htdocs/Text-Similarity/Text/OverlapFinder.pm.html. 78

[45] NIST, T. Text retrieval conference (trec) data - english relevance judgements, 2006. http://trec.nist.gov/data/reljudge_eng.html. 44

[46] Park, J. H., and Croft, W. B. Query term ranking based on dependency parsing of verbose queries. Proceeding of the 33rd international ACM SICIR con-ference on Research and development in information retrieval (2010), 829–830. 3, 19, 20

[47] Phan, N., Bailey, P., and Wilkinson, R. Understanding the relationship of information need specificity to search query length. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in in-formation retrieval (New York, NY, USA, 2007), SIGIR ’07, ACM, pp. 709–710. 12

[48] Ponte, J. M., and Croft, W. B. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 1998), SIGIR ’98, ACM, pp. 275–281. 24, 30

[49] Riezler, S., Vasserman, A., Tsochantaridis, I., Mittal, V., and Liu, Y. Statistical machine translation for query expansion in answer retrieval. In Procee-dings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07) (Prague, Czech Republic, 2007). 30

[50] Robertson. The probability ranking principle in ir. Journal of Documentation (1977), 33:294–304. Reprinted in (Sparck Jones and Willett, 1997). 26

[51] Robertson, and Jones, S. Relevance weighting of search terms. Journal of the American Society for Information Science (1976), 27:129–146. Reprinted in (Willett, 1988). 25

[52] Robertson, S. E., and Walker, S. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. Proc. of SIGIR (1994), 232–241. 24

[53] Salton, G., and Buckley, C. Improving retrieval performance by relevance feedback. Tech. rep., Ithaca, NY, USA, 1988. 60

[54] Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing. Tech. rep., Ithaca, NY, USA, 1974. 33

[55] Solr tutorial. http://lucene.apache.org/solr/tutorial.html. 61

[56] Speer, R., and Havasi, C. Conceptnet 5, 2012. http://conceptnet5.media. mit.edu/. 35

[57] Speer, R., and Havasi, C. Representing general relational knowledge in conceptnet 5. In Proceedings of LREC (2012). 34

[58] Tan, P.-N., Steinbach, M., and Kumar, V. Introduction to Data Mining. Pearson International Edition, 2006. 42

[59] Tomlinson, S. Lexical and algorithmic stemming compared for 9 european lan-guages with hummingbird searchserver at clef 2003. In Cross-Language Evaluation Forum (2003). 15

[60] Solrj. http://wiki.apache.org/solr/Solrj. 63

[61] Vasilescu, F., Langlais, P., and Lapalme, G. Evaluating variants of the lesk approach for disambiguating words. In Proceedings of Language Resources and Evaluation (LREC 2004) (Lisbonne, Portugal, 2004), pp. 633–636. 41

[62] Wang, X., and Zhai, C. Mining term association patterns from search logs for effective query reformulation. In Proceedings of the 17th ACM conference on Information and knowledge management (New York, NY, USA, 2008), CIKM ’08, ACM, pp. 479–488. 30

[63] Wiktionary. http://www.wiktionary.org/. 35

[64] About wordnet, 2012. http://wordnet.princeton.edu/. 20, 35

[65] Xue, X., Huston, S., and Croft, W. B. Improving verbose queries using subset distribution. In Proceedings of the 19th ACM international conference on Information and knowledge management (New York, NY, USA, 2010), CIKM ’10, ACM, pp. 1059–1068. 23

[66] Xue, X., Jeon, J., and Croft, W. B. Retrieval models for question and answer archives. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 2008), SIGIR ’08, ACM, pp. 475–482. 11, 30

Vorrei ringraziare il Prof. Massimo Melucci, relatore di questa tesi, per la costante cortesia dimostratami e le preziose indicazioni che mi sono state date durante questi mesi. Desidero, inoltre, ringraziare il Dr. Emanuele di Buccio per la sua disponibilit`a e gli utili suggerimenti. Intendo poi ringraziare il NIST per i documenti della collezione TREC Robust 2004.

Hanno collaborato in qualit`a di giudici a questa tesi: Daniele Barilaro, Alessando Di Pieri, Andrea Marcato, Giulia Moro, Lorena Moro, Luca Pellegrini, Rossella Petrucci, Lucia Petterle; a loro va un grazie particolare per il tempo che hanno dedicato a questo incarico e all’attenzione che vi hanno riposto.

Infine, ringrazio con affetto la mia famiglia, che mi ha sempre sostenuta e incorag-giata.

Nel documento AnnoAccademico2011-2012 Riconoscereedelaborareinterrogazioniprolisse IngegneriaInformatica TesidiLaureaMagistralein (pagine 86-101)