HyperLogLog (HLL)

This is a high performance implementation of Phillipe Flajolet’s HLL sketch but with significantly improved error behavior.

If the ONLY use case for sketching is counting uniques and merging, the HLL sketch is a reasonable choice, although the highest performing in terms of accuracy for storage space consumed is CPC (Compressed Probabilistic Counting). For large enough counts, this HLL version (with HLL_4) can be 2 to 16 times smaller than the Theta sketch family for the same accuracy.

This implementation offers three different types of HLL sketch, each with different trade-offs with accuracy, space and performance. These types are specified with the target_hll_type parameter.

In terms of accuracy, all three types, for the same lg_config_k, have the same error distribution as a function of n, the number of unique values fed to the sketch. The configuration parameter lg_config_k is the log-base-2 of k, where k is the number of buckets or slots for the sketch.

During warmup, when the sketch has only received a small number of unique items (up to about 10% of k), this implementation leverages a new class of estimator algorithms with significantly better accuracy.

class tgt_hll_type

Target HLL flavor

HLL_4 : 4 bits per entry
HLL_6 : 6 bits per entry
HLL_8 : 8 bits per entry
class hll_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes) _datasketches.hll_sketch

Reads a bytes object and returns the corresponding hll_sketch

get_max_updatable_serialization_bytes(lg_k: int, tgt_type: _datasketches.tgt_hll_type) int

Provides a likely upper bound on serialization size for the given parameters

get_rel_err(upper_bound: bool, unioned: bool, lg_k: int, num_std_devs: int) float

Returns the a priori relative error bound for the given parameters

Non-static Methods:

__init__(self, lg_k: int, tgt_type: _datasketches.tgt_hll_type = _datasketches.tgt_hll_type.HLL_8, start_max_size: bool = False) None

Constructs a new HLL sketch

Parameters:
  • lg_config_k (int) – A full sketch can hold 2^lg_config_k rows. Must be between 7 and 21, inclusive,

  • tgt_type (tgt_hll_type) – The HLL mode to use, if/when the sketch reaches estimation mode

  • start_full_size (bool) – Indicates whether to start in HLL mode, keeping memory use constant (if HLL_6 or HLL_8) at the cost of much higher initial memory use. Default (and recommended) is False.

get_compact_serialization_bytes

Returns the size of the serialized sketch when compressing the exception table if HLL_4

get_estimate

Estimate of the distinct count of the input stream

get_lower_bound

Returns the approximate lower error bound given the specified number of standard deviations in {1, 2, 3}

get_updatable_serialization_bytes

Returns the size of the serialized sketch

get_upper_bound

Returns the approximate upper error bound given the specified number of standard deviations in {1, 2, 3}

is_compact

True if the sketch is compact, otherwise False

is_empty

True if the sketch is empty, otherwise False

property lg_config_k

Configured lg_k value for the sketch

reset

Resets the sketch to the empty state in coupon collection mode

serialize_compact

Serializes the sketch into a bytes object, compressing the exception table if HLL_4

serialize_updatable

Serializes the sketch into a bytes object

property tgt_type

The HLL type (4, 6, or 8) when in estimation mode

to_string

Produces a string summary of the sketch

update

Overloaded function.

  1. update(self, datum: int) -> None

Updates the sketch with the given integral value

  1. update(self, datum: float) -> None

Updates the sketch with the given floating point value

  1. update(self, datum: str) -> None

Updates the sketch with the given string value

class hll_union(*args, **kwargs)

Static Methods:

get_rel_err(upper_bound: bool, unioned: bool, lg_k: int, num_std_devs: int) float

Returns the a priori relative error bound for the given parameters

Non-static Methods:

__init__(self, lg_max_k: int) None

Construct an hll_union object if the given size.

Parameters:

lg_max_k (int) – The maximum size, in log2, of k. Must be between 7 and 21, inclusive.

get_estimate

Estimate of the distinct count of the input stream

get_lower_bound

Returns the approximate lower error bound given the specified number of standard deviations in {1, 2, 3}

get_result

Returns a sketch of the target type representing the current union state

get_upper_bound

Returns the approximate upper error bound given the specified number of standard deviations in {1, 2, 3}

is_empty

True if the union is empty, otherwise False

property lg_config_k

Configured lg_k value for the union

reset

Resets the union to the empty state

update

Overloaded function.

  1. update(self, sketch: _datasketches.hll_sketch) -> None

Updates the union with the given HLL sketch

  1. update(self, datum: int) -> None

Updates the union with the given integral value

  1. update(self, datum: float) -> None

Updates the union with the given floating point value

  1. update(self, datum: str) -> None

Updates the union with the given string value