# Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

Andoni, A. & Indyk, P., 2008. Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions.

*Communications of the ACM*, 51(1), pp.117—122.## Summary

This article gives an overview of locality sensitive hashing algorithms, i.e. methods for approximating the nearest neighbor problem.## Concepts

**distance**:The distance between*d-*dimensional data points*P*is frequently computed with the following $$\mathcal{l}_s$$*norm*metric: \[||p-q||_s = \bigl(\sum_{i=1}^d |p_i-q_i|^s\bigr)^{1/s}\]**R-nearest neighbor**: a point*p*is an*R-near*neighbor*q*if the distance between these points is at most R.**locality sensitive hashing**: $$\mathcal{H}$$ is a family of hash functions*h*which map data points to some universe*U*, so that*h*has a high probability $$Pr_{\mathcal{H}}[h(q) = h(p)]$$ if*p*and*q*are R-nearest neighbors, and a considerably lower probability if they are not.

## Design

Typically the probability gap between R-nearest neighbors and other points is rather small, requiring us to amplify this gap by combining several hash functions $$\mathcal{H}$$ yielding\[g_j(q) = h_{1,j}(q), ..., h_{k,j}(q)\]

and by computing *L* buckets $$g_j(q)$$ with *j=1, ...L*. If the buckets store binary values *L* corresponds to the hash length in bits. The hash functions $$h_{t,j} (1\leq t \leq k, 1 \leq j \leq L)$$ are randomly chosen.

- Number of hashes
*k*: the number of hashes*k*determines the probability $$P_1^k$$ that $$g_i(p) = g_i(q)$$. Large values of*k*increase the method's specificity (i.e. reduces $$P_1$$ and, therefore, the number of false positives). - Number of hash tables
*L*: larger values of*L*increase the method's sensitivity (i.e. the number of true positives). Consequently,*L*must be sufficiently large (especially for high values of*k*) to ensure that all R-near neighbors collide with the query point at least ones.

## Hash functions

The following families of hash functions (i.e. distance metrics) $$\mathcal{H}$$ are frequently used for LSH:- The
*h**amming distance*which corresponds to the number of points in which*p*and*q*differ. - $$\mathcal{l}_s$$ distance
*Jaccard**measure:*$$s(A,B) = \frac{|A \cap B|}{|A \cup B|}$$. Since*s(A,B)*is a similarity measure, the distance needs to be computed by*d(A, B) =1-s(A, B)**Arccos:*$$\Theta(p,q) = arccos\bigl(\frac{p\cdot q}{||p||\cdot||q||}\bigr)$$ - this distance measure the angle between two vectors.