Kan, Min-Yen and Tan, Yee Fan: Record Matching in Digital Library Metdata, Communications of the ACM, Volume 51 (2), 91-94
The article provides an excellent overview over techniques used to identify duplicate records in digital libraries. The authors presents
- uniform string matching techniques based on set-matching (the Jaccard measure, cosine similarity, degree of similarity, ...), sequence based measures (edit distance), and hyprid approaches (e.g. set-matching for single words and sequence based measures for the terms in sentences), and
- graphical formalisms as for instance social networks, network cuts and random walks to distinguish different authors with the same name.