Exact and Bounded, Probabilitiy Proportional to Size (EBPPS) Sampling
An EBPPS sketch produces a randome sample of data from a stream of items, ensuring that the probability of including an item is always exactly equal to the item’s size. The size of an item is defined as its weight relative to the total weight of all items seen so far by the sketch. In contrast to VarOpt sampling, this sketch may return fewer than k items in order to keep the probability of including an item strictly proportional to its size.
This sketch is based on: B. Hentschel, P. J. Haas, Y. Tian “Exact PPS Sampling with Bounded Sample Size”, Information Processing Letters, 2023.
EBPPS sampling is related to reservoir sampling, but handles unequal item weights. Feeding the sketch items with a uniform weight value will produce a sample equivalent to reservoir sampling.
Note
Serializing and deserializing this sketch requires the use of a PyObjectSerDe
.
- class ebpps_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.ebpps_sketch
Reads a bytes object and returns the corresponding ebpps_sketch
Non-static Methods:
- __init__(self, k: int) None
Creates a new EBPPS sketch instance
- Parameters:
k (int) – Maximum number of samples in the sketch
- property c
The expected number of samples returned upon a call to get_result() or the creation of an iterator. The number is a floating point value, where the fractional portion represents the probability of including a “partial item” from the sample. The value C should be no larger than the sketch’s configured value of k, although numerical precision limitations mean it may exceed k by double precision floating point error margins in certain cases.
- get_serialized_size_bytes
Computes the size in bytes needed to serialize the current sketch
- is_empty
Returns True if the sketch is empty, otherwise False
- property k
The sketch’s maximum configured sample size
- merge
Merges the sketch with the given sketch
- property n
The total stream length
- serialize
Serializes the sketch into a bytes object
- to_string
Produces a string summary of the sketch and optionally prints the items
- update
Updates the sketch with the given value and weight