Finding Text Reuse in the Web

less than 1 minute read

by Michael Bendersky and W. Bruce Croft (WSDM'09)

This article discusses an approach for finding three different kinds of text reuse in the web:

  • verbatim copies (nearly duplicate sentences)
  • restatements
  • references to the same event which only bears topical similarity to the original sentence
The most interesting points discussed in this article are:

  1. the presented algorithm for retrieving related sentences
  2. discussed measurements for determining text reuse (word overlap, query likelihood, mixture model, dependence model)
  3. different approaches to determine a source's date (earliest date, earliest date in the longest dense sequence, closest date in context, ...)
The points above could be relevant for DIVINE as well as for our new FP7 projects.