by Dimitry Zibold
The following article summarizes some interesting aspects from Dimitry's research:
- A Shingle is a contiguous sub-sequence of tokens in a document (e.g. rei, bei, ...).
- N-Grams are sub-sequences of n terms (e.g. nuclear power plant, ...)
- Levenshtein distance = Edit distance
- Broder, A. Z. et. al (1997): Syntactic Clustering
- Chidwdhury et. al (2002): I-Match (detects similar documents