# It Probably Works

Mcmullen, T. (2015). It Probably Works. Commun. ACM, 58(11), 50—54. http://doi.org/10.1145/2814332

## Introduction

This article distinguishes between three types of probabilistic algorithms:- algorithms which explicitly introduce randomness through a rand call()
- approaches that convert input data to a uniform distribution, and
- algorithms where randomness is intrinsic to the data (e.g. sensor networks that only cover a fraction of the environment, applications where processing power limits the fraction of data to be analyzed, etc.).

## Probabilistic algorithms

The following sections introduce probabilistic algorithms to solve different standard tasks.### Count-Distinct

Count-distinct determines the cardinality of a multiset (i.e. the number of unique items within a stream).**HyperLogLog**is a probabilistic algorithm that addresses this problem by leveraging the following two observations:

- the probability of a certain bit being set in a binary representation is 0.5
- the probability of two independent events $$A$$ and $$B$$ occurring both is $$P(A)*P(B)$$
- the expected number of trials until an event $$A$$ occurs is $$1/P(A)$$

*uniform*distribution is required.

**Workflow:**

- hash input items
- count the number of leading bits that are set and keep track on the maximum number of leading bits
- the estimated number of distinct items $$n$$ equals to $$2^{n_{\text{max_bits}}}$$

*max_bits*in these baskets is a good estimate for $$log2(n/m)$$.

**Accuracy: **

For a usage scenario of 1000 servers with 100 million unique items per server the HyperLogLog algorithm is able to provide an estimation with a 2% skew based on only 0.0002% of the total data.

### Nearest-Neighbor Search

This algorithm identifies similar items, i.e. items that are similar according to a distance function such as the cosine-similarity. In text mining relevant features for the nearest-neighbor search are usually high dimensional word vectors that suffer from the**curse of dimensionality**, i.e. they contain many sparse features and computation of vectors with many dimensions is time consuming. Locality sensitive hashing (LSH) is an approach that addresses this problem by generating a hash that is similar for similar documents.

**Workflow:**
For an LSH with *n* bits and word vectors of size *m* we

- generate
*n*random*m*-dimensional vectors - hash a document by
- multiplying its word vector with each of the
*n*random vectors (each of the*n*random vectors corresponds to one bit of the LSH) - if the dot product produces a negative number, set the corresponding bit to 0, otherwise to 1.

- multiplying its word vector with each of the

### Reliable Broadcast

This use case aims at broadcasting messages to a large number of peers through a high-latency, lossy network (i.e. it is not guaranteed that all messages are received).Bimodal Multicast is an algorithm that addresses this problem based on the following workflow:

- broadcast the message over the network (some of the messages might get lost)
- apply the following gossip protocol:
- each server contacts another randomly selected server and sends it a digest of its current state (i.e. a protocol that allows the receiver to determine whether it has missed any of these messages).
- the receiver responds with a list of messages it has not yet received and the gossip initiator will resend these messages.

**Other algorithms**

- Consistent hashing which is often used for load balancing minimizes the necessity to remap values if the size of the hash table is changed.
- Bloom filters which determine whether an entity is not part of a set or
*might be*included in that set. - the Mincut algorithm to address routing problems in networks