Compressed Probabilistic Counting (CPC)

High performance C++ implementation of Compressed Probabilistic Counting (CPC) Sketch. This is a unique-counting sketch that implements the Compressed Probabilistic Counting (CPC, a.k.a FM85) algorithms developed by Kevin Lang in his paper Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm. This sketch is extremely space-efficient when serialized. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not. The update speed performance of this sketch is quite fast and is comparable to the speed of HLL. The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K. For additional security this sketch can be configured with a user-specified hash seed.

class cpc_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes) _datasketches.cpc_sketch

Reads a bytes object and returns the corresponding cpc_sketch

Non-static Methods:

__init__(self, lg_k: int = 11, seed: int = 9001) None

Creates a new CPC sketch

Parameters:
  • lg_k (int, optional) – base 2 logarithm of the number of bins in the sketch

  • seed (int, optional) – seed value for the hash function

get_estimate

Estimate of the distinct count of the input stream

get_lower_bound

Returns an approximate lower bound on the estimate for kappa values in {1, 2, 3}, roughly corresponding to standard deviations

get_upper_bound

Returns an approximate upper bound on the estimate for kappa values in {1, 2, 3}, roughly corresponding to standard deviations

is_empty

Returns True if the sketch is empty, otherwise False

property lg_k

Configured lg_k of this sketch

serialize

Serializes the sketch into a bytes object

to_string

Produces a string summary of the sketch

update

Overloaded function.

  1. update(self, datum: int) -> None

Updates the sketch with the given 64-bit integer value

  1. update(self, datum: float) -> None

Updates the sketch with the given 64-bit floating point

  1. update(self, datum: str) -> None

Updates the sketch with the given string

class cpc_union(*args, **kwargs)
__init__(self, lg_k: int, seed: int = 9001) None
get_result

Returns a CPC sketch with the result of the union

update

Updates the union with the provided CPC sketch