It Probably Works

3 minute read

Mcmullen, T. (2015). It Probably Works. Commun. ACM, 58(11), 50—54. http://doi.org/10.1145/2814332

Introduction

This article distinguishes between three types of probabilistic algorithms:

  1. algorithms which explicitly introduce randomness through a rand call()
  2. approaches that convert input data to a uniform distribution, and
  3. algorithms where randomness is intrinsic to the data (e.g. sensor networks that only cover a fraction of the environment, applications where processing power limits the fraction of data to be analyzed, etc.).

Probabilistic algorithms

The following sections introduce probabilistic algorithms to solve different standard tasks.

Count-Distinct

Count-distinct determines the cardinality of a multiset (i.e. the number of unique items within a stream). HyperLogLog is a probabilistic algorithm that addresses this problem by leveraging the following two observations:

  1. the probability of a certain bit being set in a binary representation is 0.5
  2. the probability of two independent events $$A$$ and $$B$$ occurring both is $$P(A)*P(B)$$
  3. the expected number of trials until an event $$A$$ occurs is $$1/P(A)$$
Since input values are usually not randomly distributed a hash function (such as MurmurHash3) that transforms values into a uniform distribution is required.

Workflow:

  1. hash input items
  2. count the number of leading bits that are set and keep track on the maximum number of leading bits
  3. the estimated number of distinct items $$n$$ equals to $$2^{n_{\text{max_bits}}}$$
The estimation can be improved by dividing the input in $$m$$ buckets based on the trailing bits and determining the maximum for each bucket (i.e. each bucket is handled as a sample). The average number of max_bits in these baskets is a good estimate for $$log2(n/m)$$.

Accuracy:

For a usage scenario of 1000 servers with 100 million unique items per server the HyperLogLog algorithm is able to provide an estimation with a 2% skew based on only 0.0002% of the total data.

Nearest-Neighbor Search

This algorithm identifies similar items, i.e. items that are similar according to a distance function such as the cosine-similarity. In text mining relevant features for the nearest-neighbor search are usually high dimensional word vectors that suffer from the curse of dimensionality, i.e. they contain many sparse features and computation of vectors with many dimensions is time consuming. Locality sensitive hashing (LSH) is an approach that addresses this problem by generating a hash that is similar for similar documents.

Workflow: For an LSH with n bits and word vectors of size m we

  1. generate n random m-dimensional vectors
  2. hash a document by
    • multiplying its word vector with each of the n random vectors (each of the n random vectors corresponds to one bit of the LSH)
    • if the dot product produces a negative number, set the corresponding bit to 0, otherwise to 1.
Comparing two hash values with each other and computing their Hamming distance approximates the cosine-similarity of the underlying word vectors.

Reliable Broadcast

This use case aims at broadcasting messages to a large number of peers through a high-latency, lossy network (i.e. it is not guaranteed that all messages are received).

Bimodal Multicast is an algorithm that addresses this problem based on the following workflow:

  1. broadcast the message over the network (some of the messages might get lost)
  2. apply the following gossip protocol:
    • each server contacts another randomly selected server and sends it a digest of its current state (i.e. a protocol that allows the receiver to determine whether it has missed any of these messages).
    • the receiver responds with a list of messages it has not yet received and the gossip initiator will resend these messages.

Other algorithms

  • Consistent hashing which is often used for load balancing minimizes the necessity to remap values if the size of the hash table is changed.
  • Bloom filters which determine whether an entity is not part of a set or might be included in that set.
  • the Mincut algorithm to address routing problems in networks