Comparing the Sensitivity of Information Retrieval Metrics

1 minute read

by Radlinsky and Craswell (SIGIR 2010)

This paper compares user behaviour based IR metrics with the following standard IR metrics:

Precision@k -- the precision at a certain cut-off point
Mean Average Precision (MAP) -- average of the Precision@k
Normalized Discounted Cumulative Gain (NDCG)

Regarding the role of evaluation metrics, the authors note that

sensitive measures are required to detect large numbers of small improvements (an absolute improvement of the IR algorithm of 5-6% was required before the direction of the difference could be detected on fifty TREC topics),
changing from informational to navigational assumptions when judging the metric can change the outcome, and
on fidelity judges are usually (i) far removed from the search process and, therefore, create unrealistic queries based on observation, and (ii) have a hard time assessing documents according to the user's _actual_ information needs.

Evaluation:

Judgment Based metrics: The authors let trained judges assess the relevance of the top ten results based on a five-point scale. As precision and MAP require binary results (relevant vs. non-relevant) these results where converted 1-2 (=relevant), 3-5 (=non-relevant) to a binary scale.

Evaluation with Interleave: The evaluation with interleave is performed by combining the results of two retrieval functions (omitting duplicates) and using the users' clicks to indicate a relative preference for one of these functions.

The evaluations show that for a small number of results the "better" ranker might perform worse according to some metrics (e.g. MAP). In conclusion, standard IR metrics require about 5,000 queries compared to 50,000 queries when using interleaving to detect small differences in the IR algorithm.

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Comparing the Sensitivity of Information Retrieval Metrics

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers