Tuple Sketch

Tuple sketches are an extension of Theta sketches, meaning they provide estimate of distinct counts, that allow the keeping of arbitrary summaries associated with each retained key (for example, a count for every key). The use of a tuple_sketch requires a TuplePolicy which defines how summaries are created, updated, merged, or intersected. The library provides a few basic examples of TuplePolicy implementations, but the right custom summary and policy can allow very complicated analysis to be performed quite easily.

Set operations (union, intersection, A-not-B) are performed through the use of dedicated objects.

Several Jaccard similarity measures can be computed between theta sketches with the tuple_jaccard_similarity class.

Note

Serializing and deserializing this sketch requires the use of a PyObjectSerDe.

class tuple_sketch

An abstract base class for tuple sketches.

DEFAULT_SEED = 9001
get_estimate

Estimate of the distinct count of the input stream

get_lower_bound

Returns an approximate lower bound on the estimate at standard deviations in {1, 2, 3}

get_seed_hash

Returns a hash of the seed used in the sketch

get_upper_bound

Returns an approximate upper bound on the estimate at standard deviations in {1, 2, 3}

is_empty

Returns True if the sketch is empty, otherwise False

is_estimation_mode

Returns True if sketch is in estimation mode, otherwise False

is_ordered

Returns True if the sketch entries are sorted, otherwise False

property num_retained

The number of items currently in the sketch

property theta

Theta (effective sampling rate) as a fraction from 0 to 1

property theta64

Theta as 64-bit value

to_string

Produces a string summary of the sketch

class update_tuple_sketch
__init__(self, policy: _datasketches.TuplePolicy, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None

Creates an update_tuple_sketch using the provided parameters

Parameters:
  • policy (TuplePolicy) – a policy to use when updating

  • lg_k (int, optional) – base 2 logarithm of the maximum size of the sketch. Default 12.

  • p (float, optional) – an initial sampling rate to use. Default 1.0

  • seed (int, optional) – the seed to use when hashing values

compact

Returns a compacted form of the sketch, optionally sorting it

reset

Resets the sketch to the initial empty state

trim

Removes retained entries in excess of the nominal size k (if any)

update

Overloaded function.

  1. update(self, datum: int, value: object) -> None

Updates the sketch with the given integral item and summary value

  1. update(self, datum: float, value: object) -> None

Updates the sketch with the given floating point item and summary value

  1. update(self, datum: str, value: object) -> None

Updates the sketch with the given string item and summary value

class compact_tuple_sketch

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe, seed: int = 9001) _datasketches.compact_tuple_sketch

Reads a bytes object and returns the corresponding compact_tuple_sketch

Non-static Methods:

__init__(self, other: _datasketches.tuple_sketch, ordered: bool = True) None
__init__(self, other: _datasketches.theta_sketch, summary: object) None

Overloaded function.

  1. __init__(self, other: _datasketches.tuple_sketch, ordered: bool = True) -> None

Creates a compact_tuple_sketch from an existing tuple_sketch.

Parameters:
  • other (tuple_sketch) – a sourch tuple_sketch

  • ordered (bool, optional) – whether the incoming sketch entries are sorted. Default True

  1. __init__(self, other: _datasketches.theta_sketch, summary: object) -> None

Creates a compact_tuple_sketch from a theta_sketch using a fixed summary value.

Parameters:
  • other (theta_sketch) – a source theta sketch

  • summary (object) – a summary to use for every sketch entry

serialize

Serializes the sketch into a bytes object

class tuple_union
__init__(self, policy: _datasketches.TuplePolicy, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None

Creates a tuple_union using the provided parameters

Parameters:
  • policy (TuplePolicy) – a policy to use when unioning entries

  • lg_k (int, optional) – base 2 logarithm of the maximum size of the union. Default 12.

  • p (float, optional) – an initial sampling rate to use. Default 1.0

  • seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds.

get_result

Returns the sketch corresponding to the union result

reset

Resets the sketch to the initial empty

update

Updates the union with the given sketch

class tuple_intersection
__init__(self, policy: _datasketches.TuplePolicy, seed: int = 9001) None

Creates a tuple_intersection using the provided parameters

Parameters:
  • policy (TuplePolicy) – a policy to use when intersecting entries

  • seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds

get_result

Returns the sketch corresponding to the intersection result

has_result

Returns True if the intersection has a valid result, otherwise False

update

Intersects the provided sketch with the current intersection state

class tuple_a_not_b
__init__(self, seed: int = 9001) None

Creates a tuple_a_not_b object

Parameters:

seed (int, optional) – the seed to use when hashing values. Must match any sketch seeds.

compute

Returns a sketch with the result of applying the A-not-B operation on the given inputs