Variance Optimal Sampling (VarOpt)

A VarOpt sketch samples data from a stream of items. The sketch is desinged for optimal (minimum) variance when querying the sketch to estimate subset sums of items matching a provided predicate. The sketch will produce a sample of size k (or smaller if fewer items have been presented), with the probability of including an item roughly corresponding it the item’s weight relative to the total weight of all items presented to the sketch.

VarOpt sampling is related to reservoir sampling, with improved error bounds for subset sum estimation. Feeding the sketch items with a uniform weight value will produce a sample equivalent to reservoir sampling.

Note

Serializing and deserializing this sketch requires the use of a PyObjectSerDe.

class var_opt_sketch

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.var_opt_sketch

Reads a bytes object and returns the corresponding var opt sketch

Non-static Methods:

__init__(self, k: int) None

Creates a new Var Opt sketch instance

Parameters:

k (int) – Maximum number of samples in the sketch

estimate_subset_sum

Applies a provided predicate to the sketch and returns the estimated total weight matching the predicate, as well as upper and lower bounds on the estimate and the total weight processed by the sketch

get_serialized_size_bytes

Computes the size in bytes needed to serialize the current sketch

is_empty

Returns True if the sketch is empty, otherwise False

property k

Returns the sketch’s maximum configured sample size

property n

Returns the total stream length

property num_samples

Returns the number of samples currently in the sketch

serialize

Serializes the sketch into a bytes object

to_string

Produces a string summary of the sketch and optionally prints the items

update

Updates the sketch with the given value and weight

class var_opt_union

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.var_opt_union

Constructs a var opt union from the given bytes using the provided serde

Non-static Methods:

__init__(self, max_k: int) None
get_result

Returns a sketch corresponding to the union result

get_serialized_size_bytes

Computes the size in bytes needed to serialize the current union

reset

Resets the union to the empty state

serialize

Serializes the union into a bytes object with the provided serde

to_string

Produces a string summary of the sketch

update

Updates the union with the given sketch