Finding Text Reuse in the Web
by Michael Bendersky and W. Bruce Croft (WSDM'09)
This article discusses an approach for finding three different kinds of text reuse in the web:
- verbatim copies (nearly duplicate sentences)
- restatements
- references to the same event which only bears topical similarity to the original sentence
- the presented algorithm for retrieving related sentences
- discussed measurements for determining text reuse (word overlap, query likelihood, mixture model, dependence model)
- different approaches to determine a source's date (earliest date, earliest date in the longest dense sequence, closest date in context, ...)