Evaluation in Information Retrieval

2 minute read

Manning, C.D., Raghavan, P. & Schütze, H., 2008. Introduction to Information Retrieval 1st ed., Cambridge University Press.

Chapter 8 - Evaluation in information retrieval

Retrieval effectiveness is measured using a test collection consisting of

  1. a document collection
  2. a test suite of information needs (50 have been found to be sufficient at minimum) that are usually expressed as queries
  3. a set of relevance judgments (binary)
Tuning parameters (if present) need to be optimized using an independent test collection rather than the evaluation collection.

Standard Test Collections

  1. Cranfield collection (historical)
  2. Text retrieval conference (TREC) - relevance judgments for 450 information needs that are called topics; TREC 6-8 provide 150 information needs over 528,000 newswire and broadcast articles and is considered one of the largest and most consistent subcollections
  3. GOV2 - 25 million pages
  4. NII Test Collections for IR Systems (NTCIR) - asian languages
  5. Cross Language Evaluation Forum (CLEF) - European languages for cross-language retrieval
  6. Reuters
  7. 20 Newsgroups - text classification

Presentation of search results

  • static snippets
  • dynamic snippets -> explain why that particular document has been retrieved
  • text summarization -> up to date systems combine positional factors (favor first and last paragraphs and the first and last sentences of paragraphs, with content factors, emphasizing sentences with key terms)


Common metrics and visualizations for unranked search results

  • Precision, Recall
  • Accuracy (tp+tn)/(tp+fp+tn+tn) - often not appropriate since retrieval data might be extremely skewed (99.9% of the documents are non-relevant => a system optimizing accuracy would perform well, when only returning negative answers)
  • F-measure - the harmonic mean between P and R (and therefore always less or equal the arithmetic mean)
Common metrics and visualizations for ranked search results

  • Precision/recall graphs (saw-tooth shape; non-relevant documents => precision drops)
  • Interpolated precision (highest precision for a certain recall level r'>=r)
  • 11-point average precision (precision at a recall of 0.0, 0.1, ... 1.0)
  • mean average precision (MAP) - approximates the area under the interpolated precision-recall curve for a single information need
  • precision at k (e.g. precision at 10)
  • R-precision - precision for |Rel| document (with |Rel| equals the number of total relevant entries in the collection)
  • Receiver Operating Characteristics (ROC Curve) - x = 1-specificity; y=recall with specifity = tn/fp+tn
  • cumulative gain and normalized discounted cumulative gain - designed for non-binary notions of relevance

Assessing Relevance

  • pooling - only the top-k documents of the collection are evaluated
  • kappa measure (inter-rater agreement)
    • idea: kappa = [P(same rating, observed) - P(same rating, random)] / [1-P(same rating, random) ]
    • interpretation:
      • above 0.8 -> good agreement
      • between 0.67 and 0.8 -> fair agreement
      • below 0.67 -> dubious


  • document relevance is treated as independence from the relevance of other documents
  • binary assessments
  • relevance is treated as an absolute objective decision -> Kekäläinen (2005) show that a 4-way relevance judgment and the notion of cumulative gain substantially affect system ranking;
  • results are based on one collection (domain-specific)


  • INEX, some TREC tracks and NTCIR -> three to four relevance classes
  • marginal relevance (usefulness of a document is assessed after the user has looked at certain other documents) - maximizing marginal relevance => diverse documents are returned
  • user utility - make users happy
    • relevance is a proxy for user happiness
    • user studies measure user happiness
    • user studies of IR system effectiveness by Saracevic and Kantor (1988, 1996)
  • refine deployed systems: A/B testing
    • deploy a modified version of the system (B)
    • user feedback to evaluate the differences between these systems (non-intrusive by using clickthrough log analysis, clickstream mining, ...)