Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

1 minute read

Andoni, A. & Indyk, P., 2008. Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. Communications of the ACM, 51(1), pp.117—122.

Summary

This article gives an overview of locality sensitive hashing algorithms, i.e. methods for approximating the nearest neighbor problem.

Concepts

  1. distance:The distance between d-dimensional data points P is frequently computed with the following $$\mathcal{l}_s$$ norm metric: \[||p-q||_s = \bigl(\sum_{i=1}^d |p_i-q_i|^s\bigr)^{1/s}\]
  2. R-nearest neighbor: a point p is an R-near neighbor of q if the distance between these points is at most R.
  3. locality sensitive hashing: $$\mathcal{H}$$ is a family of hash functions h which map data points to some universe U, so that h has a high probability $$Pr_{\mathcal{H}}[h(q) = h(p)]$$ if p and q are R-nearest neighbors, and a considerably lower probability if they are not.

Design

Typically the probability gap between R-nearest neighbors and other points is rather small, requiring us to amplify this gap by combining several hash functions $$\mathcal{H}$$ yielding

\[g_j(q) = h_{1,j}(q), ..., h_{k,j}(q)\]

and by computing L buckets $$g_j(q)$$ with j=1, ...L. If the buckets store binary values L corresponds to the hash length in bits. The hash functions $$h_{t,j} (1\leq t \leq k, 1 \leq j \leq L)$$ are randomly chosen.

  1. Number of hashes k: the number of hashes k determines the probability $$P_1^k$$ that $$g_i(p) = g_i(q)$$. Large values of k increase the method's specificity (i.e. reduces $$P_1$$ and, therefore, the number of false positives).
  2. Number of hash tables L: larger values of L increase the method's sensitivity (i.e. the number of true positives). Consequently, L must be sufficiently large (especially for high values of k) to ensure that all R-near neighbors collide with the query point at least ones.

Hash functions

The following families of hash functions (i.e. distance metrics) $$\mathcal{H}$$ are frequently used for LSH:

  1. The hamming distance which corresponds to the number of points in which p and q differ.
  2. $$\mathcal{l}_s$$ distance
  3. Jaccard measure: $$s(A,B) = \frac{|A \cap B|}{|A \cup B|}$$. Since s(A,B) is a similarity measure, the distance needs to be computed by d(A, B) =1-s(A, B)
  4. Arccos: $$\Theta(p,q) = arccos\bigl(\frac{p\cdot q}{||p||\cdot||q||}\bigr)$$ - this distance measure the angle between two vectors.