IR2140: Reading notes for Unit 6

· To measure ad hoc information retrieval effectiveness, we need a test collection:

1. A document collection

2. A test suite of information needs

3. A set of relevance judgments

· A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.

Evaluation of unranked retrieval sets

· The two most frequent and basic measures for information retrieval effectiveness are precision and recall.
· Accuracy is not an appropriate measure for information retrieval problems. A system tuned to maximize accuracy can appear to perform well by simply deeming all documents non-relevant to all queries. However, labeling all documents as non-relevant is completely unsatisfying to an information retrieval system user.
· A single measure that trades off precision versus recall is the F measure.
· We use a harmonic mean rather than the simpler average (arithmetic mean).

Evaluation of ranked retrieval results

· Examining the entire precision-recall curve is very informative, but there is often a desire to boil this information down to a few numbers, or perhaps even a single number. The traditional way of doing this is the 11-point interpolated average precision.
· In web search, what matters is rather how many good results there are on the first page or the first three pages. This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents.
· Mean Average Precision (MAP), which provides a single-figure measure of quality across recall levels.
· An alternative, which alleviates this problem, is R-precision. It requires having a set of known relevant documents Rel, from which we calculate the precision of the top Rel documents returned.
· Another concept sometimes used in evaluation ROC CURVE is an ROC curve. In many fields, a common aggregate measure is to report the area under the ROC curve, which is the ROC analog of MAP.

Assessing relevance

· To properly evaluate a system, your test information needs must be germane to the documents in the test document collection, and appropriate for predicted usage of the system.
· Given information needs and documents, you need to collect relevance assessments. The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems, and perhaps other sources such as the results of Boolean keyword searches or documents found by expert searchers in an interactive process.
· Rather, humans and their relevance judgments are quite idiosyncratic and variable. In the social sciences, a common measure for agreement between judges is the kappa statistic.
· The choice of a human can make a considerable absolute difference to reported scores, but has in general been found to have little impact on the relative effectiveness ranking of either different systems or variants of a single system which are being compared for effectiveness.

Critiques and justifications of the concept of relevance

· One clear problem with the relevance-based assessment that we have presented is the distinction between relevance and marginal relevance.

Results snippets

· The two basic kinds of summaries are static, which are always the same regardless of the query, and dynamic (or query-dependent), which are customized according to the user’s information need as deduced from a query. Dynamic summaries attempt to explain why a particular document was retrieved for the query at hand.

· Dynamic summaries are generally regarded as greatly improving the usability of IR systems, but they present a complication for IR system design

IR2140