Theta Sketch
Theta sketches are used for distinct counting.
The theta package contains the basic sketch classes that are members of the Theta Sketch Framework. There is a separate Tuple package for many of the sketches that are derived from the same algorithms defined in the Theta Sketch Framework paper.
The Theta Sketch sketch is a space-efficient method for estimating cardinalities of sets. It can also easily handle set operations (such as union, intersection, difference) while maintaining good accuracy. Theta sketch is a practical variant of the K-Minimum Values sketch which avoids the need to sort the stored hash values on every insertion to the sketch. It has better error properties than the HyperLogLog sketch for set operations beyond the simple union.
Set operations (union, intersection, A-not-B) are performed through the use of dedicated objects.
Several Jaccard similarity
measures can be computed between theta sketches with the theta_jaccard_similarity
class.
- class theta_sketch
An abstract base class for theta sketches
- get_estimate
Estimate of the distinct count of the input stream
- get_lower_bound
Returns an approximate lower bound on the estimate at standard deviations in {1, 2, 3}
- get_seed_hash
Returns a hash of the seed used in the sketch
- get_upper_bound
Returns an approximate upper bound on the estimate at standard deviations in {1, 2, 3}
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if sketch is in estimation mode, otherwise False
- is_ordered
Returns True if the sketch entries are sorted, otherwise False
- property num_retained
The number of items currently in the sketch
- property theta
Theta (effective sampling rate) as a fraction from 0 to 1
- property theta64
Theta as 64-bit value
- to_string
Produces a string summary of the sketch
- class update_theta_sketch(*args, **kwargs)
- __init__(self, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None
Creates an update_theta_sketch using the provided parameters
- Parameters:
lg_k (int, optional) – base 2 logarithm of the maximum size of the sketch. Default 12.
p (float, optional) – an initial sampling rate to use. Default 1.0
seed (int, optional) – the seed to use when hashing values
- compact
Returns a compacted form of the sketch, optionally sorting it
- reset
Resets the sketch to the initial empty state
- trim
Removes retained entries in excess of the nominal size k (if any)
- update
Overloaded function.
update(self, datum: int) -> None
Updates the sketch with the given integral value
update(self, datum: float) -> None
Updates the sketch with the given floating point value
update(self, datum: str) -> None
Updates the sketch with the given string
- class compact_theta_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes, seed: int = 9001) _datasketches.compact_theta_sketch
Reads a bytes object and returns the corresponding compact_theta_sketch
Non-static Methods:
- __init__(self, arg0: _datasketches.theta_sketch, arg1: bool, /) None
Creates a compact_theta_sketch from an existing theta_sketch.
- Parameters:
other (theta_sketch) – a source theta_sketch
ordered (bool) – whether the incoming sketch entries are sorted. Default True
- serialize
Serializes the sketch into a bytes object
- class theta_union(*args, **kwargs)
- __init__(self, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None
Creates a theta_union using the provided parameters
- Parameters:
lg_k (int, optional) – base 2 logarithm of the maximum size of the union. Default 12.
p (float, optional) – an initial sampling rate to use. Default 1.0
seed (int, optional) – the seed to use when hashing values. Must match all sketch seeds.
- get_result
Returns the sketch corresponding to the union result
- update
Updates the union with the given sketch
- class theta_intersection(*args, **kwargs)
- __init__(self, seed: int = 9001) None
Creates a theta_intersection using the provided parameters
- Parameters:
seed (int, optional) – the seed to use when hashing values. Must match all sketch seeds
- get_result
Returns the sketch corresponding to the intersection result
- has_result
Returns True if the intersection has a valid result, otherwise False
- update
Intersections the provided sketch with the current intersection state
- class theta_a_not_b(*args, **kwargs)
- __init__(self, seed: int = 9001) None
Creates a tuple_a_not_b object
- Parameters:
seed (int, optional) – the seed to use when hashing values. Must match all sketch seeds.
- compute
Returns a sketch with the result of applying the A-not-B operation on the given inputs