Tuple Sketch

Tuple sketches are an extension of Theta sketches, meaning they provide estimate of distinct counts, that allow the keeping of arbitrary summaries associated with each retained key (for example, a count for every key). The use of a tuple_sketch requires a TuplePolicy which defines how summaries are created, updated, merged, or intersected. The library provides a few basic examples of TuplePolicy implementations, but the right custom summary and policy can allow very complicated analysis to be performed quite easily.

Set operations (union, intersection, A-not-B) are performed through the use of dedicated objects.

Several Jaccard similarity measures can be computed between theta sketches with the tuple_jaccard_similarity class.


Serializing and deserializing this sketch requires the use of a PyObjectSerDe.

class tuple_sketch

An abstract base class for tuple sketches.


Estimate of the distinct count of the input stream


Returns an approximate lower bound on the estimate at standard deviations in {1, 2, 3}


Returns a hash of the seed used in the sketch


Returns an approximate upper bound on the estimate at standard deviations in {1, 2, 3}


Returns True if the sketch is empty, otherwise False


Returns True if sketch is in estimation mode, otherwise False


Returns True if the sketch entries are sorted, otherwise False

property num_retained

The number of items currently in the sketch

property theta

Theta (effective sampling rate) as a fraction from 0 to 1

property theta64

Theta as 64-bit value


Produces a string summary of the sketch

class update_tuple_sketch(*args, **kwargs)
__init__(self, policy: _datasketches.TuplePolicy, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None

Creates an update_tuple_sketch using the provided parameters

  • policy (TuplePolicy) – a policy to use when updating

  • lg_k (int, optional) – base 2 logarithm of the maximum size of the sketch. Default 12.

  • p (float, optional) – an initial sampling rate to use. Default 1.0

  • seed (int, optional) – the seed to use when hashing values


Returns a compacted form of the sketch, optionally sorting it


Produces a compact_tuple_sketch from the given sketch by applying a predicate to the summary in each entry.


predicate – A function returning true or value evaluated on each tuple summary


A compact_tuple_sketch with the selected entries

Return type:



Resets the sketch to the initial empty state


Removes retained entries in excess of the nominal size k (if any)


Overloaded function.

  1. update(self, datum: int, value: object) -> None

Updates the sketch with the given integral item and summary value

  1. update(self, datum: float, value: object) -> None

Updates the sketch with the given floating point item and summary value

  1. update(self, datum: str, value: object) -> None

Updates the sketch with the given string item and summary value

class compact_tuple_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe, seed: int = 9001) _datasketches.compact_tuple_sketch

Reads a bytes object and returns the corresponding compact_tuple_sketch

Non-static Methods:

__init__(self, other: _datasketches.tuple_sketch, ordered: bool = True) None
__init__(self, other: _datasketches.theta_sketch, summary: object) None

Overloaded function.

  1. __init__(self, other: _datasketches.tuple_sketch, ordered: bool = True) -> None

Creates a compact_tuple_sketch from an existing tuple_sketch.

  • other (tuple_sketch) – a sourch tuple_sketch

  • ordered (bool, optional) – whether the incoming sketch entries are sorted. Default True

  1. __init__(self, other: _datasketches.theta_sketch, summary: object) -> None

Creates a compact_tuple_sketch from a theta_sketch using a fixed summary value.

  • other (theta_sketch) – a source theta sketch

  • summary (object) – a summary to use for every sketch entry


Produces a compact_tuple_sketch from the given sketch by applying a predicate to the summary in each entry.


predicate – A function returning true or value evaluated on each tuple summary


A compact_tuple_sketch with the selected entries

Return type:



Serializes the sketch into a bytes object

class tuple_union(*args, **kwargs)
__init__(self, policy: _datasketches.TuplePolicy, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None

Creates a tuple_union using the provided parameters

  • policy (TuplePolicy) – a policy to use when unioning entries

  • lg_k (int, optional) – base 2 logarithm of the maximum size of the union. Default 12.

  • p (float, optional) – an initial sampling rate to use. Default 1.0

  • seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds.


Returns the sketch corresponding to the union result


Resets the sketch to the initial empty


Updates the union with the given sketch

class tuple_intersection(*args, **kwargs)
__init__(self, policy: _datasketches.TuplePolicy, seed: int = 9001) None

Creates a tuple_intersection using the provided parameters

  • policy (TuplePolicy) – a policy to use when intersecting entries

  • seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds


Returns the sketch corresponding to the intersection result


Returns True if the intersection has a valid result, otherwise False


Intersects the provided sketch with the current intersection state

class tuple_a_not_b(*args, **kwargs)
__init__(self, seed: int = 9001) None

Creates a tuple_a_not_b object


seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds.


Returns a sketch with the result of applying the A-not-B operation on the given inputs