Tuple Sketch
Tuple sketches are an extension of Theta sketches, meaning they provide estimate of distinct counts, that
allow the keeping of arbitrary summaries associated with each retained key
(for example, a count for every key). The use of a tuple_sketch
requires a TuplePolicy
which
defines how summaries are created, updated, merged, or intersected. The library provides a few basic
examples of TuplePolicy
implementations, but the right custom summary and policy can allow very
complicated analysis to be performed quite easily.
Set operations (union, intersection, A-not-B) are performed through the use of dedicated objects.
Several Jaccard similarity
measures can be computed between theta sketches with the tuple_jaccard_similarity
class.
Note
Serializing and deserializing this sketch requires the use of a PyObjectSerDe
.
- class tuple_sketch
An abstract base class for tuple sketches.
- DEFAULT_SEED = 9001
- get_estimate
Estimate of the distinct count of the input stream
- get_lower_bound
Returns an approximate lower bound on the estimate at standard deviations in {1, 2, 3}
- get_seed_hash
Returns a hash of the seed used in the sketch
- get_upper_bound
Returns an approximate upper bound on the estimate at standard deviations in {1, 2, 3}
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if sketch is in estimation mode, otherwise False
- is_ordered
Returns True if the sketch entries are sorted, otherwise False
- property num_retained
The number of items currently in the sketch
- property theta
Theta (effective sampling rate) as a fraction from 0 to 1
- property theta64
Theta as 64-bit value
- to_string
Produces a string summary of the sketch
- class update_tuple_sketch(*args, **kwargs)
- __init__(self, policy: _datasketches.TuplePolicy, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None
Creates an update_tuple_sketch using the provided parameters
- Parameters:
policy (TuplePolicy) – a policy to use when updating
lg_k (int, optional) – base 2 logarithm of the maximum size of the sketch. Default 12.
p (float, optional) – an initial sampling rate to use. Default 1.0
seed (int, optional) – the seed to use when hashing values
- compact
Returns a compacted form of the sketch, optionally sorting it
- reset
Resets the sketch to the initial empty state
- trim
Removes retained entries in excess of the nominal size k (if any)
- update
Overloaded function.
update(self, datum: int, value: object) -> None
Updates the sketch with the given integral item and summary value
update(self, datum: float, value: object) -> None
Updates the sketch with the given floating point item and summary value
update(self, datum: str, value: object) -> None
Updates the sketch with the given string item and summary value
- class compact_tuple_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe, seed: int = 9001) _datasketches.compact_tuple_sketch
Reads a bytes object and returns the corresponding compact_tuple_sketch
Non-static Methods:
- __init__(self, other: _datasketches.tuple_sketch, ordered: bool = True) None
- __init__(self, other: _datasketches.theta_sketch, summary: object) None
Overloaded function.
__init__(self, other: _datasketches.tuple_sketch, ordered: bool = True) -> None
Creates a compact_tuple_sketch from an existing tuple_sketch.
- Parameters:
other (tuple_sketch) – a sourch tuple_sketch
ordered (bool, optional) – whether the incoming sketch entries are sorted. Default True
__init__(self, other: _datasketches.theta_sketch, summary: object) -> None
Creates a compact_tuple_sketch from a theta_sketch using a fixed summary value.
- Parameters:
other (theta_sketch) – a source theta sketch
summary (object) – a summary to use for every sketch entry
- serialize
Serializes the sketch into a bytes object
- class tuple_union(*args, **kwargs)
- __init__(self, policy: _datasketches.TuplePolicy, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None
Creates a tuple_union using the provided parameters
- Parameters:
policy (TuplePolicy) – a policy to use when unioning entries
lg_k (int, optional) – base 2 logarithm of the maximum size of the union. Default 12.
p (float, optional) – an initial sampling rate to use. Default 1.0
seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds.
- get_result
Returns the sketch corresponding to the union result
- reset
Resets the sketch to the initial empty
- update
Updates the union with the given sketch
- class tuple_intersection(*args, **kwargs)
- __init__(self, policy: _datasketches.TuplePolicy, seed: int = 9001) None
Creates a tuple_intersection using the provided parameters
- Parameters:
policy (TuplePolicy) – a policy to use when intersecting entries
seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds
- get_result
Returns the sketch corresponding to the intersection result
- has_result
Returns True if the intersection has a valid result, otherwise False
- update
Intersects the provided sketch with the current intersection state
- class tuple_a_not_b(*args, **kwargs)
- __init__(self, seed: int = 9001) None
Creates a tuple_a_not_b object
- Parameters:
seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds.
- compute
Returns a sketch with the result of applying the A-not-B operation on the given inputs