Theta Sketch

Theta sketches are used for distinct counting.

The theta package contains the basic sketch classes that are members of the Theta Sketch Framework. There is a separate Tuple package for many of the sketches that are derived from the same algorithms defined in the Theta Sketch Framework paper.

The Theta Sketch sketch is a space-efficient method for estimating cardinalities of sets. It can also easily handle set operations (such as union, intersection, difference) while maintaining good accuracy. Theta sketch is a practical variant of the K-Minimum Values sketch which avoids the need to sort the stored hash values on every insertion to the sketch. It has better error properties than the HyperLogLog sketch for set operations beyond the simple union.

Set operations (union, intersection, A-not-B) are performed through the use of dedicated objects.

Several Jaccard similarity measures can be computed between theta sketches with the theta_jaccard_similarity class.

class theta_sketch

An abstract base class for theta sketches

get_estimate

Estimate of the distinct count of the input stream

get_lower_bound

Returns an approximate lower bound on the estimate at standard deviations in {1, 2, 3}

get_seed_hash

Returns a hash of the seed used in the sketch

get_upper_bound

Returns an approximate upper bound on the estimate at standard deviations in {1, 2, 3}

is_empty

Returns True if the sketch is empty, otherwise False

is_estimation_mode

Returns True if sketch is in estimation mode, otherwise False

is_ordered

Returns True if the sketch entries are sorted, otherwise False

property num_retained

The number of items currently in the sketch

property theta

Theta (effective sampling rate) as a fraction from 0 to 1

property theta64

Theta as 64-bit value

to_string

Produces a string summary of the sketch

class update_theta_sketch
__init__(self, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None

Creates an update_theta_sketch using the provided parameters

Parameters:
  • lg_k (int, optional) – base 2 logarithm of the maximum size of the sketch. Default 12.

  • p (float, optional) – an initial sampling rate to use. Default 1.0

  • seed (int, optional) – the seed to use when hashing values

compact

Returns a compacted form of the sketch, optionally sorting it

reset

Resets the sketch to the initial empty state

trim

Removes retained entries in excess of the nominal size k (if any)

update

Overloaded function.

  1. update(self, datum: int) -> None

Updates the sketch with the given integral value

  1. update(self, datum: float) -> None

Updates the sketch with the given floating point value

  1. update(self, datum: str) -> None

Updates the sketch with the given string

class compact_theta_sketch

Static Methods:

deserialize(bytes: bytes, seed: int = 9001) _datasketches.compact_theta_sketch

Reads a bytes object and returns the corresponding compact_theta_sketch

Non-static Methods:

__init__(self, arg0: _datasketches.theta_sketch, arg1: bool, /) None

Creates a compact_theta_sketch from an existing theta_sketch.

Parameters:
  • other (theta_sketch) – a source theta_sketch

  • ordered (bool) – whether the incoming sketch entries are sorted. Default True

serialize

Serializes the sketch into a bytes object, optionally compressing the data

class theta_union
__init__(self, lg_k: int = 12, p: float = 1.0, seed: int = 9001) None

Creates a theta_union using the provided parameters

Parameters:
  • lg_k (int, optional) – base 2 logarithm of the maximum size of the union. Default 12.

  • p (float, optional) – an initial sampling rate to use. Default 1.0

  • seed (int, optional) – the seed to use when hashing values. Must match all sketch seeds.

get_result

Returns the sketch corresponding to the union result

update

Updates the union with the given sketch

class theta_intersection
__init__(self, seed: int = 9001) None

Creates a theta_intersection using the provided parameters

Parameters:

seed (int, optional) – the seed to use when hashing values. Must match all sketch seeds

get_result

Returns the sketch corresponding to the intersection result

has_result

Returns True if the intersection has a valid result, otherwise False

update

Intersections the provided sketch with the current intersection state

class theta_a_not_b
__init__(self, seed: int = 9001) None

Creates a tuple_a_not_b object

Parameters:

seed (int, optional) – the seed to use when hashing values. Must match all sketch seeds.

compute

Returns a sketch with the result of applying the A-not-B operation on the given inputs