Evaluation in Information Retrieval

2 minute read

Manning, C.D., Raghavan, P. & Schütze, H., 2008. Introduction to Information Retrieval 1st ed., Cambridge University Press.

Chapter 8 - Evaluation in information retrieval

Retrieval effectiveness is measured using a test collection consisting of

a document collection
a test suite of information needs (50 have been found to be sufficient at minimum) that are usually expressed as queries
a set of relevance judgments (binary)

Tuning parameters (if present) need to be optimized using an independent test collection rather than the evaluation collection.

Cranfield collection (historical)
Text retrieval conference (TREC) - relevance judgments for 450 information needs that are called topics; TREC 6-8 provide 150 information needs over 528,000 newswire and broadcast articles and is considered one of the largest and most consistent subcollections
GOV2 - 25 million pages
NII Test Collections for IR Systems (NTCIR) - asian languages
Cross Language Evaluation Forum (CLEF) - European languages for cross-language retrieval
Reuters
20 Newsgroups - text classification

static snippets
dynamic snippets -> explain why that particular document has been retrieved
text summarization -> up to date systems combine positional factors (favor first and last paragraphs and the first and last sentences of paragraphs, with content factors, emphasizing sentences with key terms)

Common metrics and visualizations for unranked search results

Precision, Recall
Accuracy (tp+tn)/(tp+fp+tn+tn) - often not appropriate since retrieval data might be extremely skewed (99.9% of the documents are non-relevant => a system optimizing accuracy would perform well, when only returning negative answers)
F-measure - the harmonic mean between P and R (and therefore always less or equal the arithmetic mean)

Common metrics and visualizations for ranked search results

Precision/recall graphs (saw-tooth shape; non-relevant documents => precision drops)
Interpolated precision (highest precision for a certain recall level r'>=r)
11-point average precision (precision at a recall of 0.0, 0.1, ... 1.0)
mean average precision (MAP) - approximates the area under the interpolated precision-recall curve for a single information need
precision at k (e.g. precision at 10)
R-precision - precision for |Rel| document (with |Rel| equals the number of total relevant entries in the collection)
Receiver Operating Characteristics (ROC Curve) - x = 1-specificity; y=recall with specifity = tn/fp+tn

cumulative gain and normalized discounted cumulative gain - designed for non-binary notions of relevance

document relevance is treated as independence from the relevance of other documents
binary assessments
relevance is treated as an absolute objective decision -> KekÃ¤lÃ¤inen (2005) show that a 4-way relevance judgment and the notion of cumulative gain substantially affect system ranking;
results are based on one collection (domain-specific)

INEX, some TREC tracks and NTCIR -> three to four relevance classes
marginal relevance (usefulness of a document is assessed after the user has looked at certain other documents) - maximizing marginal relevance => diverse documents are returned
user utility - make users happy
- relevance is a proxy for user happiness
- user studies measure user happiness
- user studies of IR system effectiveness by Saracevic and Kantor (1988, 1996)

refine deployed systems: A/B testing
- deploy a modified version of the system (B)
- user feedback to evaluate the differences between these systems (non-intrusive by using clickthrough log analysis, clickstream mining, ...)