# Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

Andoni, A. & Indyk, P., 2008. Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. Communications of the ACM, 51(1), pp.117—122.

## Summary

This article gives an overview of locality sensitive hashing algorithms, i.e. methods for approximating the nearest neighbor problem.

## Concepts

1. distance:The distance between d-dimensional data points P is frequently computed with the following $\mathcal{l}_s$ norm metric: $||p-q||_s = \bigl(\sum_{i=1}^d |p_i-q_i|^s\bigr)^{1/s}$
2. R-nearest neighbor: a point p is an R-near neighbor of q if the distance between these points is at most R.
3. locality sensitive hashing: $\mathcal{H}$ is a family of hash functions h which map data points to some universe U, so that h has a high probability $Pr_{\mathcal{H}}[h(q) = h(p)]$ if p and q are R-nearest neighbors, and a considerably lower probability if they are not.

## Design

Typically the probability gap between R-nearest neighbors and other points is rather small, requiring us to amplify this gap by combining several hash functions $\mathcal{H}$ yielding

$g_j(q) = h_{1,j}(q), ..., h_{k,j}(q)$

and by computing L buckets $g_j(q)$ with j=1, ...L. If the buckets store binary values L corresponds to the hash length in bits. The hash functions $h_{t,j} (1\leq t \leq k, 1 \leq j \leq L)$ are randomly chosen.

1. Number of hashes k: the number of hashes k determines the probability $P_1^k$ that $g_i(p) = g_i(q)$. Large values of k increase the method's specificity (i.e. reduces $P_1$ and, therefore, the number of false positives).
2. Number of hash tables L: larger values of L increase the method's sensitivity (i.e. the number of true positives). Consequently, L must be sufficiently large (especially for high values of k) to ensure that all R-near neighbors collide with the query point at least ones.

## Hash functions

The following families of hash functions (i.e. distance metrics) $\mathcal{H}$ are frequently used for LSH:

1. The hamming distance which corresponds to the number of points in which p and q differ.
2. $\mathcal{l}_s$ distance
3. Jaccard measure: $s(A,B) = \frac{|A \cap B|}{|A \cup B|}$. Since s(A,B) is a similarity measure, the distance needs to be computed by d(A, B) =1-s(A, B)
4. Arccos: $\Theta(p,q) = arccos\bigl(\frac{p\cdot q}{||p||\cdot||q||}\bigr)$ - this distance measure the angle between two vectors.

Tags:

Categories:

Updated: