Exact and Bounded, Probabilitiy Proportional to Size (EBPPS) Sampling

An EBPPS sketch produces a randome sample of data from a stream of items, ensuring that the probability of including an item is always exactly equal to the item’s size. The size of an item is defined as its weight relative to the total weight of all items seen so far by the sketch. In contrast to VarOpt sampling, this sketch may return fewer than k items in order to keep the probability of including an item strictly proportional to its size.

This sketch is based on: B. Hentschel, P. J. Haas, Y. Tian “Exact PPS Sampling with Bounded Sample Size”, Information Processing Letters, 2023.

EBPPS sampling is related to reservoir sampling, but handles unequal item weights. Feeding the sketch items with a uniform weight value will produce a sample equivalent to reservoir sampling.

Note

Serializing and deserializing this sketch requires the use of a PyObjectSerDe.

class ebpps_sketch

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.ebpps_sketch

Reads a bytes object and returns the corresponding ebpps_sketch

Non-static Methods:

__init__(self, k: int) None

Creates a new EBPPS sketch instance

Parameters:

k (int) – Maximum number of samples in the sketch

property c

The expected number of samples returned upon a call to get_result() or the creation of an iterator. The number is a floating point value, where the fractional portion represents the probability of including a “partial item” from the sample. The value C should be no larger than the sketch’s configured value of k, although numerical precision limitations mean it may exceed k by double precision floating point error margins in certain cases.

get_serialized_size_bytes

Computes the size in bytes needed to serialize the current sketch

is_empty

Returns True if the sketch is empty, otherwise False

property k

The sketch’s maximum configured sample size

merge

Merges the sketch with the given sketch

property n

The total stream length

serialize

Serializes the sketch into a bytes object

to_string

Produces a string summary of the sketch and optionally prints the items

update

Updates the sketch with the given value and weight