An Open Digest-based Technique for Spam Detection

1 minute read

Damiani, E. et al., 2004. An Open Digest-based Technique for Spam Detection. In in Proceedings of the 2004 International Workshop on Security in Parallel and Distributed Systems. pp. 15—17.

Summary

This paper discusses the Nilsimsa open digest hash algorithm which is frequently used for Spam detection. The authors describe the computation of the 32-byte code, discuss different attack scenarios and measures to counter them.

Digest computation

slide a five character window through the input text and compute all eight possible tri-grams for each window (e.g. "igram" yields "igr", "gra", "ram", "ira", "iam", "grm", ...)
hash these trigrams using a hash function h() which maps every tri-gram to one of 256 accumulators and increment the corresponding accumulator. Nilsimsa uses the Trans53 hash function for hashing.
at the end of the process described below, compute the expected value of the accumulators and set the bits which correspond to each accumulator either to (1) if it exceeds this threshold or (o) otherwise.

Similarity computation

The Nilsimsa similarity is computed based on the bitwise difference between two Nilsimsa hashes. Documents are considered similar if they exceed a pre-defined similarity value.

>24 similar bits - conflict probability: 1.35E-4 (suggestions by Nilsimsa's original designer)
>54 similar bits - conflict probability: 7.39E-12 (suggested by the article's authors)

Attacks

Spammers can apply multiple techniques to prevent Nilsimsa from detecting duplicates:

Random addition: requires >300% of additional text to prevent detection.
Thesaurus substitutions: require >20% of replaced text
Perceptive substitution (security > s3cur1ty): requires >15% of the text to be altered
Aimed attacks (i.e. attacks which specifically target Nilsimsa): ~10% of the text needs to be altered

Aimed attacks manipulate Nilsimsa's accumulators by adding words which introduce new tri-grams that specifically alter the hash value. Although these attacks are relatively effective, they can be easily circumvented by computing the Nilsimsa hash twice with different hash functions.

Software

Java implementation of the Nilsimsa algorithm

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

An Open Digest-based Technique for Spam Detection

Summary

Digest computation

Similarity computation

Attacks

Software

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers