HyperLogLog (HLL)
This is a high performance implementation of Phillipe Flajolet’s HLL sketch but with significantly improved error behavior.
If the ONLY use case for sketching is counting uniques and merging, the HLL sketch is a reasonable choice, although the highest performing in terms of accuracy for storage space consumed is CPC (Compressed Probabilistic Counting). For large enough counts, this HLL version (with HLL_4) can be 2 to 16 times smaller than the Theta sketch family for the same accuracy.
This implementation offers three different types of HLL sketch, each with different trade-offs with accuracy, space and performance. These types are specified with the target_hll_type parameter.
In terms of accuracy, all three types, for the same lg_config_k, have the same error distribution as a function of n
, the number of unique values fed to the sketch.
The configuration parameter lg_config_k
is the log-base-2 of k
, where k
is the number of buckets or slots for the sketch.
During warmup, when the sketch has only received a small number of unique items (up to about 10% of k
), this implementation leverages a new class of estimator algorithms with significantly better accuracy.
- class tgt_hll_type
Target HLL flavor
- HLL_4 : 4 bits per entry
- HLL_6 : 6 bits per entry
- HLL_8 : 8 bits per entry
- class hll_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes) _datasketches.hll_sketch
Reads a bytes object and returns the corresponding hll_sketch
- get_max_updatable_serialization_bytes(lg_k: int, tgt_type: _datasketches.tgt_hll_type) int
Provides a likely upper bound on serialization size for the given parameters
- get_rel_err(upper_bound: bool, unioned: bool, lg_k: int, num_std_devs: int) float
Returns the a priori relative error bound for the given parameters
Non-static Methods:
- __init__(self, lg_k: int, tgt_type: _datasketches.tgt_hll_type = _datasketches.tgt_hll_type.HLL_8, start_max_size: bool = False) None
Constructs a new HLL sketch
- Parameters:
lg_config_k (int) – A full sketch can hold 2^lg_config_k rows. Must be between 7 and 21, inclusive,
tgt_type (tgt_hll_type) – The HLL mode to use, if/when the sketch reaches estimation mode
start_full_size (bool) – Indicates whether to start in HLL mode, keeping memory use constant (if HLL_6 or HLL_8) at the cost of much higher initial memory use. Default (and recommended) is False.
- get_compact_serialization_bytes
Returns the size of the serialized sketch when compressing the exception table if HLL_4
- get_estimate
Estimate of the distinct count of the input stream
- get_lower_bound
Returns the approximate lower error bound given the specified number of standard deviations in {1, 2, 3}
- get_updatable_serialization_bytes
Returns the size of the serialized sketch
- get_upper_bound
Returns the approximate upper error bound given the specified number of standard deviations in {1, 2, 3}
- is_compact
True if the sketch is compact, otherwise False
- is_empty
True if the sketch is empty, otherwise False
- property lg_config_k
Configured lg_k value for the sketch
- reset
Resets the sketch to the empty state in coupon collection mode
- serialize_compact
Serializes the sketch into a bytes object, compressing the exception table if HLL_4
- serialize_updatable
Serializes the sketch into a bytes object
- property tgt_type
The HLL type (4, 6, or 8) when in estimation mode
- to_string
Produces a string summary of the sketch
- update
Overloaded function.
update(self, datum: int) -> None
Updates the sketch with the given integral value
update(self, datum: float) -> None
Updates the sketch with the given floating point value
update(self, datum: str) -> None
Updates the sketch with the given string value
- class hll_union(*args, **kwargs)
Static Methods:
- get_rel_err(upper_bound: bool, unioned: bool, lg_k: int, num_std_devs: int) float
Returns the a priori relative error bound for the given parameters
Non-static Methods:
- __init__(self, lg_max_k: int) None
Construct an hll_union object if the given size.
- Parameters:
lg_max_k (int) – The maximum size, in log2, of k. Must be between 7 and 21, inclusive.
- get_estimate
Estimate of the distinct count of the input stream
- get_lower_bound
Returns the approximate lower error bound given the specified number of standard deviations in {1, 2, 3}
- get_result
Returns a sketch of the target type representing the current union state
- get_upper_bound
Returns the approximate upper error bound given the specified number of standard deviations in {1, 2, 3}
- is_empty
True if the union is empty, otherwise False
- property lg_config_k
Configured lg_k value for the union
- reset
Resets the union to the empty state
- update
Overloaded function.
update(self, sketch: _datasketches.hll_sketch) -> None
Updates the union with the given HLL sketch
update(self, datum: int) -> None
Updates the union with the given integral value
update(self, datum: float) -> None
Updates the union with the given floating point value
update(self, datum: str) -> None
Updates the union with the given string value