Finding Text Reuse in the Web

Albert Weichselbraun

is a Professor of Information Science at the University of Applied of the Grisons.

Finding Text Reuse in the Web

less than 1 minute read

by Michael Bendersky and W. Bruce Croft (WSDM'09)

This article discusses an approach for finding three different kinds of text reuse in the web:

verbatim copies (nearly duplicate sentences)
restatements
references to the same event which only bears topical similarity to the original sentence

The most interesting points discussed in this article are:

the presented algorithm for retrieving related sentences
discussed measurements for determining text reuse (word overlap, query likelihood, mixture model, dependence model)
different approaches to determine a source's date (earliest date, earliest date in the longest dense sequence, closest date in context, ...)

The points above could be relevant for DIVINE as well as for our new FP7 projects.

Share on

Twitter Facebook LinkedIn

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

less than 1 minute read

Integrating earth observation data with linked open data would pave the way for easy reuse and integration of these datasets. The article discusses how knowl...

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

less than 1 minute read

Career websites contain valuable data on employees, their skill sets and, employment history. This article uses k-means clustering on keywords describing ski...

Suffix array

1 minute read

The suffix array is a memory-efficient alternative to the suffix tree which provides a sorted list of string indices indicating the string’s suffixes.

Dynamic feature scaling for online learning of binary classifiers

less than 1 minute read

This article describes and evaluates different online feature scaling approaches and their impact on the performance of binary classifiers. online feature...