Information Diffusion

less than 1 minute read

by Dimitry Zibold

The following article summarizes some interesting aspects from Dimitry's research:

  • A Shingle is a contiguous sub-sequence of tokens in a document (e.g. rei, bei, ...).
  • N-Grams are sub-sequences of n terms (e.g. nuclear power plant, ...)
  • Levenshtein distance = Edit distance

Technologies

  • Broder, A. Z. et. al (1997): Syntactic Clustering
  • Chidwdhury et. al (2002): I-Match (detects similar documents