9 Relevance feedback andquery expansion
9.1 Relevance feedback and pseudo relevance feedback
9.1.1 The Rocchio algorithm for relevance feedback
The Rocchio algorithm is the classic algorithm for implementing RF. It mod-els a way of incorporating relevance feedback information into the vector space model of Section 6.3.
The underlying theory. We want to find a query vector, denoted as q, that maximizes similarity with relevant documents while minimizing similarity with nonrelevant documents. If Cr is the set of relevant documents and Cnr
164 Relevance feedback and query expansion
Figure 9.1 RF searching over images. (a) The user views the initial query results for a query of bike, selects the first, third, and fourth results in the top row and the fourth result in the bottom row as relevant, and submits this feedback. (b) The user sees the revised result set. Precision is greatly improved. Fromhttp://nayana.ece.ucsb.edu/imsearch/imsearch.html(Newsam et al. 2001).
is the set of nonrelevant documents, then we wish to find1:
qopt= arg max
q [sim(q, Cr)− sim(q, Cnr)], (9.1)
where sim is defined as in Equation (6.10). Under cosine similarity, the opti-mal query vectorqoptfor separating the relevant and nonrelevant documents is:
1In the equation, arg maxx f (x) returns a value of x, which maximizes the value of the function f (x). Similarly, arg minx f (x) returns a value of x, which minimizes the value of the function f (x).
9.1 Relevance feedback and pseudo relevance feedback 165 (a) Query: New space satellite applications
(b) + 1. 0.539, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer
+ 2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite Plan 3. 0.528, 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes
4. 0.526, 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat:
Staying Within Budget
5. 0.525, 07/24/90, Scientist Who Exposed Global Warming Proposes Satellites for Climate Research
6. 0.524, 08/22/90, Report Provides Support for the Critics Of Using Big Satellites to Study Climate
7. 0.516, 04/13/87, Arianespace Receives Satellite Launch Pact From Telesat Canada
+ 8. 0.509, 12/02/87, Telecommunications Tale of Two Companies (c) 2.074 new 15.106 space
30.816 satellite 5.660 application 5.991 nasa 5.196 eos
4.196 launch 3.972 aster
3.516 instrument 3.446 arianespace 3.004 bundespost 2.806 ss
2.790 rocket 2.053 scientist 2.003 broadcast 1.172 earth 0.836 oil 0.646 measure
(d) * 1. 0.513, 07/09/91, NASA Scratches Environment Gear From Satellite Plan
* 2. 0.500, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer 3. 0.493, 08/07/89, When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own
4. 0.493, 07/31/89, NASA Uses ‘Warm’ Superconductors For Fast Circuit
* 5. 0.492, 12/02/87, Telecommunications Tale of Two Companies
6. 0.491, 07/09/91, Soviets May Adapt Parts of SS-20 Missile For Commercial Use
7. 0.490, 07/12/88, Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers
8. 0.490, 06/14/90, Rescue of Satellite By Space Agency To Cost $90 Million Figure 9.2 Example of relevance feedback on a text collection. (a) The initial query. (b) The user marks some relevant documents (shown with a plus sign). (c) The query is then expanded by 18 terms with weights as shown. (d) The revised top results are then shown. A * marks the docu-ments which were judged relevant in the relevance feedback phase.
166 Relevance feedback and query expansion
Figure 9.3 The Rocchio optimal query for separating relevant and nonrelevant documents.
That is, the optimal query is the vector difference between the centroids of the relevant and nonrelevant documents (Figure 9.3). However, this observation is not terribly useful, precisely because the full set of relevant documents is not known; it is what we want to find.
The Rocchio (1971) algorithm. This was the relevance feedback mechanism Rocchio
algorithm introduced in and popularized by Salton’s SMART system around 1970. In a real IR query context, we have a user query and partial knowledge of known relevant and nonrelevant documents. The algorithm proposes using the modified query qm:
qm= αq0+ β 1
where q0 is the original query vector; Dr and Dnrare the set of known rele-vant and nonrelerele-vant documents, respectively; andα, β, and γ are weights attached to each term. These control the balance between trusting the judged document set versus the query: If we have a lot of judged documents, we would like a higherβ and γ . Starting from q0, the new query moves you some distance toward the centroid of the relevant documents and some dis-tance away from the centroid of the nonrelevant documents. This new query can be used for retrieval in the standard vector space model (see Section 6.3).
We can easily leave the positive quadrant of the vector space by subtracting off a nonrelevant document’s vector. In the Rocchio algorithm, negative term weights are ignored. That is, the term weight is set to 0. Figure 9.4 shows the effect of applying relevance feedback.
Relevance feedback can improve both recall and precision. But, in prac-tice, it has been shown to be most useful for increasing recall in situations where recall is important. This is partly because the technique expands the
9.1 Relevance feedback and pseudo relevance feedback 167
Figure 9.4 An application of Rocchio’s algorithm. Some documents have been labeled as rele-vant and nonrelerele-vant and the initial query vector is moved in response to this feedback.
query, but it is also partly an effect of the use case: When they want high recall, users can be expected to take time to review results and to iterate on the search. Positive feedback also turns out to be much more valuable than negative feedback, and so most IR systems setγ < β. Reasonable val-ues might beα = 1, β = 0.75, and γ = 0.15. In fact, many systems, such as the image search system in Figure 9.1, allow only positive feedback, which is equivalent to settingγ = 0. Another alternative is to use only the marked nonrelevant document that received the highest ranking from the IR system as negative feedback (here,|Dnr| = 1 in Equation (9.3)). Although many of the experimental results comparing various relevance feedback variants are rather inconclusive, some studies have suggested that this variant, called Ide Ide dec-hi
dec-hi is the most effective or at least the most consistent performer.
?
Exercise 9.1Under what conditions would the modified query qm in Equa-tion (9.3) be the same as the original query q0? In all other cases, is qmcloser than q0to the centroid of the relevant documents?Exercise 9.2Why is positive feedback likely to be more useful than negative feedback to an IR system? Why might only using one nonrelevant docu-ment be more effective than using several?
Exercise 9.3Suppose that a user’s initial query ischeap CDs cheap DVDs ex-tremely cheap CDs. The user examines two documents, d1and d2. She judges d1, with the content CDs cheap software cheap CDs relevant and d2 with content cheap thrills DVDs nonrelevant. Assume that we are using direct term frequency (with no scaling and no document frequency). There is no need to length-normalize vectors. Using Rocchio relevance feedback as in Equation (9.3), what would the revised query vector be after relevance feedback? Assumeα = 1, β = 0.75, γ = 0.25.
168 Relevance feedback and query expansion Exercise 9.4[ ] Omar has implemented an RF web search system, where he is going to do RF based only on words in the title text returned for a page (for efficiency). The user is going to rank three results. The first user, Jinx-ing, queries for:
banana slug
and the top three titles returned are:
banana slug Ariolimax columbianus Santa Cruz mountains banana slug Santa Cruz Campus Mascot
Jinxing judges the first two documents relevant, and the third nonrelevant.
Assume that Omar’s search engine uses term frequency but no length normalization or IDF. Assume that he is using the Rocchio RF mecha-nism, withα = β = γ = 1. Show the final revised query that would be run.
(Please list the vector elements in alphabetical order.)
✄ 9.1.2 Probabilistic relevance feedback
Rather than reweighting the query in a vector space, if a user has told us some relevant and nonrelevant documents, then we can proceed to build a classifier. One way of doing this is with a Naive Bayes probabilistic model.
If R is a Boolean indicator variable expressing the relevance of a document, then we can estimate P(xt= 1), the probability of a term t appearing in a document, depending on whether it is relevant or not, as:
P(xˆ t = 1|R = 1) = |VRt|/|VR|
(9.4)
P(xˆ t = 0|R = 0) = (dft− |VRt|)/(N − |VR|)
where N is the total number of documents, dft is the number that contain t, V R is the set of known relevant documents, and V Rtis the subset of this set containing t. Even though the set of known relevant documents is a perhaps small subset of the true set of relevant documents, if we assume that the set of relevant documents is a small subset of the set of all documents, then the es-timates given above will be reasonable. This gives a basis for another way of changing the query term weights. We discuss such probabilistic approaches more in Chapters 11 and 13, and in particular outline the application to rel-evance feedback in Section 11.3.4 (page 209). For the moment, observe that using just Equation (9.4) as a basis for term weighting is likely insufficient.
The equations use only collection statistics and information about the term distribution within the documents judged relevant. They preserve no mem-ory of the original query.
9.1 Relevance feedback and pseudo relevance feedback 169