Distinct Counting

Distinct counting is one of the earliest tasks to which sketches were applied. The concept is simple: Provide an estimate of the number of unique elements in a set of data. One of the earliest solutions came from Flajolet and Martin in 1985 with their seminal work Probabilistic counting Algorithms for Data Base Applications.

The DataSketches library offers several types of distinct counting sketches, each with different properties.

  • hll_sketch: Hyper Log Log, a well-known sketch for distinct counting but no longer state-of-the-art.

  • cpc_sketch: Provides a better accuracy-space trade-off than HLL, but with a somewhat larger footprint while in-memory.

  • theta_sketch: Theta sketch, a type of k-minimum value sketch, which provide good performance with intersection and set difference operations.

  • tuple_sketch: Tuple sketch, which is similar to a theta sketch but supports additional data stored with each key.