by Radlinsky and Craswell (SIGIR 2010)
This paper compares user behaviour based IR metrics with the following standard IR metrics:
- Precision@k -- the precision at a certain cut-off point
- Mean Average Precision (MAP) -- average of the Precision@k
- Normalized Discounted Cumulative Gain (NDCG)
- sensitive measures are required to detect large numbers of small improvements (an absolute improvement of the IR algorithm of 5-6% was required before the direction of the difference could be detected on fifty TREC topics),
- changing from informational to navigational assumptions when judging the metric can change the outcome, and
- on fidelity judges are usually (i) far removed from the search process and, therefore, create unrealistic queries based on observation, and (ii) have a hard time assessing documents according to the user's _actual_ information needs.
Judgment Based metrics: The authors let trained judges assess the relevance of the top ten results based on a five-point scale. As precision and MAP require binary results (relevant vs. non-relevant) these results where converted 1-2 (=relevant), 3-5 (=non-relevant) to a binary scale.
Evaluation with Interleave: The evaluation with interleave is performed by combining the results of two retrieval functions (omitting duplicates) and using the users' clicks to indicate a relative preference for one of these functions.
The evaluations show that for a small number of results the "better" ranker might perform worse according to some metrics (e.g. MAP). In conclusion, standard IR metrics require about 5,000 queries compared to 50,000 queries when using interleaving to detect small differences in the IR algorithm.