Tuple Sketch

Tuple sketches are an extension of Theta sketches, meaning they provide estimate of distinct counts, that allow the keeping of arbitrary summaries associated with each retained key (for example, a count for every key). The use of a tuple_sketch requires a TuplePolicy which defines how summaries are created, updated, merged, or intersected. The library provides a few basic examples of TuplePolicy implementations, but the right custom summary and policy can allow very complicated analysis to be performed quite easily.

Set operations (union, intersection, A-not-B) are performed through the use of dedicated objects.

Several Jaccard similarity measures can be computed between theta sketches with the tuple_jaccard_similarity class.

Note

Serializing and deserializing this sketch requires the use of a PyObjectSerDe.

class tuple_sketch

An abstract base class for tuple sketches.

DEFAULT_SEED = 9001

get_estimate: Estimate of the distinct count of the input stream

get_lower_bound: Returns an approximate lower bound on the estimate at standard deviations in {1, 2, 3}

get_seed_hash: Returns a hash of the seed used in the sketch

get_upper_bound: Returns an approximate upper bound on the estimate at standard deviations in {1, 2, 3}

is_empty: Returns True if the sketch is empty, otherwise False

is_estimation_mode: Returns True if sketch is in estimation mode, otherwise False

is_ordered: Returns True if the sketch entries are sorted, otherwise False

property num_retained: The number of items currently in the sketch

property theta: Theta (effective sampling rate) as a fraction from 0 to 1

property theta64: Theta as 64-bit value

to_string: Produces a string summary of the sketch

class update_tuple_sketch

__init__(self, policy: _datasketches.TuplePolicy, lg_k: int = 12, p: float = 1.0, seed: int = 9001) → None

Creates an update_tuple_sketch using the provided parameters

Parameters:

policy (TuplePolicy) – a policy to use when updating
lg_k (int, optional) – base 2 logarithm of the maximum size of the sketch. Default 12.
p (float, optional) – an initial sampling rate to use. Default 1.0
seed (int, optional) – the seed to use when hashing values

compact: Returns a compacted form of the sketch, optionally sorting it

reset: Resets the sketch to the initial empty state

trim: Removes retained entries in excess of the nominal size k (if any)

update

Overloaded function.

update(self, datum: int, value: object) -> None

Updates the sketch with the given integral item and summary value

update(self, datum: float, value: object) -> None

Updates the sketch with the given floating point item and summary value

update(self, datum: str, value: object) -> None

Updates the sketch with the given string item and summary value

class compact_tuple_sketch

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe, seed: int = 9001) → _datasketches.compact_tuple_sketch: Reads a bytes object and returns the corresponding compact_tuple_sketch

Non-static Methods:

__init__(self, other: _datasketches.tuple_sketch, ordered: bool = True) → None

__init__(self, other: _datasketches.theta_sketch, summary: object) → None

Overloaded function.

__init__(self, other: _datasketches.tuple_sketch, ordered: bool = True) -> None

Creates a compact_tuple_sketch from an existing tuple_sketch.

Parameters:

other (tuple_sketch) – a sourch tuple_sketch
ordered (bool, optional) – whether the incoming sketch entries are sorted. Default True

__init__(self, other: _datasketches.theta_sketch, summary: object) -> None

Creates a compact_tuple_sketch from a theta_sketch using a fixed summary value.

Parameters:

other (theta_sketch) – a source theta sketch
summary (object) – a summary to use for every sketch entry

serialize: Serializes the sketch into a bytes object

class tuple_union

__init__(self, policy: _datasketches.TuplePolicy, lg_k: int = 12, p: float = 1.0, seed: int = 9001) → None

Creates a tuple_union using the provided parameters

Parameters:

policy (TuplePolicy) – a policy to use when unioning entries
lg_k (int, optional) – base 2 logarithm of the maximum size of the union. Default 12.
p (float, optional) – an initial sampling rate to use. Default 1.0
seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds.

get_result: Returns the sketch corresponding to the union result

reset: Resets the sketch to the initial empty

update: Updates the union with the given sketch

class tuple_intersection

__init__(self, policy: _datasketches.TuplePolicy, seed: int = 9001) → None

Creates a tuple_intersection using the provided parameters

Parameters:

policy (TuplePolicy) – a policy to use when intersecting entries
seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds

get_result: Returns the sketch corresponding to the intersection result

has_result: Returns True if the intersection has a valid result, otherwise False

update: Intersects the provided sketch with the current intersection state

class tuple_a_not_b

__init__(self, seed: int = 9001) → None

Creates a tuple_a_not_b object

Parameters:: seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds.

compute: Returns a sketch with the result of applying the A-not-B operation on the given inputs