Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

1 minute read

Andoni, A. & Indyk, P., 2008. Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. Communications of the ACM, 51(1), pp.117—122.

Summary

This article gives an overview of locality sensitive hashing algorithms, i.e. methods for approximating the nearest neighbor problem.

Concepts

distance:The distance between d-dimensional data points P is frequently computed with the following $$\mathcal{l}_s$$ norm metric: \[||p-q||_s = \bigl(\sum_{i=1}^d |p_i-q_i|^s\bigr)^{1/s}\]

R-nearest neighbor: a point p is an R-near neighbor of q if the distance between these points is at most R.
locality sensitive hashing: $$\mathcal{H}$$ is a family of hash functions h which map data points to some universe U, so that h has a high probability $$Pr_{\mathcal{H}}[h(q) = h(p)]$$ if p and q are R-nearest neighbors, and a considerably lower probability if they are not.

Design

Typically the probability gap between R-nearest neighbors and other points is rather small, requiring us to amplify this gap by combining several hash functions $$\mathcal{H}$$ yielding

\[g_j(q) = h_{1,j}(q), ..., h_{k,j}(q)\]

and by computing L buckets $$g_j(q)$$ with j=1, ...L. If the buckets store binary values L corresponds to the hash length in bits. The hash functions $$h_{t,j} (1\leq t \leq k, 1 \leq j \leq L)$$ are randomly chosen.

Number of hashes k: the number of hashes k determines the probability $$P_1^k$$ that $$g_i(p) = g_i(q)$$. Large values of k increase the method's specificity (i.e. reduces $$P_1$$ and, therefore, the number of false positives).
Number of hash tables L: larger values of L increase the method's sensitivity (i.e. the number of true positives). Consequently, L must be sufficiently large (especially for high values of k) to ensure that all R-near neighbors collide with the query point at least ones.

Hash functions

The following families of hash functions (i.e. distance metrics) $$\mathcal{H}$$ are frequently used for LSH:

The hamming distance which corresponds to the number of points in which p and q differ.
$$\mathcal{l}_s$$ distance
Jaccard measure: $$s(A,B) = \frac{|A \cap B|}{|A \cup B|}$$. Since s(A,B) is a similarity measure, the distance needs to be computed by d(A, B) =1-s(A, B)
Arccos: $$\Theta(p,q) = arccos\bigl(\frac{p\cdot q}{||p||\cdot||q||}\bigr)$$ - this distance measure the angle between two vectors.

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

Summary

Concepts

Design

Hash functions

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers