Relative Error Quantiles (REQ) Sketch
This is an implementation based on the paper “Relative Error Streaming Quantiles” by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel Veselý, and loosely derived from a Python prototype written by Pavel Veselý.
This implementation differs from the algorithm described in the paper in the following:
The algorithm requires no upper bound on the stream length. Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). Initially, the relative-compactor starts with INIT_NUMBER_OF_SECTIONS. Each time the number of compactions (variable state) exceeds 2^{numSections - 1}, we double numSections. Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.
The size of each section (variable k and section_size in the code and parameter k in the paper) is initialized with a number set by the user via variable k. When the number of sections doubles, we decrease section_size by a factor of sqrt(2). This is applied at each level separately. Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. sqrt(2) (+/- rounding).
The merge operation here does not perform “special compactions”, which are used in the paper to allow for a tight mathematical analysis of the sketch. This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.
The Python prototype only implemented high accuracy for low ranks. This implementation provides the user with the ability to choose either high rank accuracy or low rank accuracy at the time of sketch construction. The Python prototype only implemented a comparison criterion of INCLUSIVE. This implementation allows the user to use both the INCLUSIVE criterion and the EXCLUSIVE criterion. This implementation provides extensive debug visibility into the operation of the sketch with two levels of detail output. This is not only useful for debugging, but is a powerful tool to help users understand how the sketch works.
Note
For the req_items_sketch
, objects must be comparable with __lt__
.
Note
Serializing and deserializing a req_items_sketch
requires the use of a PyObjectSerDe
.
- class req_ints_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes) _datasketches.req_ints_sketch
Deserializes the sketch from a bytes object.
- get_RSE(k: int, rank: float, is_hra: bool, n: int) float
Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in http://arxiv.org/abs/2004.01668v2, but the constant factors have been modified based on empirical measurements, for a given value of parameter k. Normalized rank must be a value between 0.0 and 1.0 (inclusive). If is_hra is True, uses high rank accuracy mode, else low rank accuracy. N is an estimate of the total number of points provided to the sketch.
Non-static Methods:
- __init__(self, k: int = 12, is_hra: bool = True) None
Creates an REQ sketch instance with the given value of k.
- Parameters:
k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 12.
is_hra (bool, optional) – Specifies whether the skech has High Rank Accuracy (True) or Low Rank Accuracy. Default True.
- get_cdf
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.
- get_max_value
Returns the maximum value from the stream. If empty, req_floats_sketch returns nan; req_ints_sketch throws a RuntimeError
- get_min_value
Returns the minimum value from the stream. If empty, req_floats_sketch returns nan; req_ints_sketch throws a RuntimeError
- get_pmf
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.
- get_quantile
Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input stream so far. For req_floats_sketch: if the sketch is empty this returns nan. For req_ints_sketch: if the sketch is empty this throws a RuntimeError.
- get_quantiles
This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.
- get_rank
Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.
- get_rank_lower_bound
Returns an approximate lower bound on the given normalized rank. Normalized rank must be a value between 0.0 and 1.0 (inclusive); the number of standard deviations must be 1, 2, or 3.
- get_rank_upper_bound
Returns an approximate upper bound on the given normalized rank. Normalized rank must be a value between 0.0 and 1.0 (inclusive); the number of standard deviations must be 1, 2, or 3.
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if the sketch is in estimation mode, otherwise False
- is_hra
Returns True if the sketch is in High Rank Accuracy mode, otherwise False
- property k
The configured parameter k
- merge
Merges the provided sketch into this one
- property n
The length of the input stream
- property num_retained
The number of retained items (samples) in the sketch
- serialize
Serializes the sketch into a bytes object.
- to_string
Produces a string summary of the sketch
- update
Overloaded function.
update(self, item: int) -> None
Updates the sketch with the given value
update(self, array: ndarray[dtype=int32]) -> None
Updates the sketch with the values in the given array
- class req_floats_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes) _datasketches.req_floats_sketch
Deserializes the sketch from a bytes object.
- get_RSE(k: int, rank: float, is_hra: bool, n: int) float
Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in http://arxiv.org/abs/2004.01668v2, but the constant factors have been modified based on empirical measurements, for a given value of parameter k. Normalized rank must be a value between 0.0 and 1.0 (inclusive). If is_hra is True, uses high rank accuracy mode, else low rank accuracy. N is an estimate of the total number of points provided to the sketch.
Non-static Methods:
- __init__(self, k: int = 12, is_hra: bool = True) None
Creates an REQ sketch instance with the given value of k.
- Parameters:
k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 12.
is_hra (bool, optional) – Specifies whether the skech has High Rank Accuracy (True) or Low Rank Accuracy. Default True.
- get_cdf
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.
- get_max_value
Returns the maximum value from the stream. If empty, req_floats_sketch returns nan; req_ints_sketch throws a RuntimeError
- get_min_value
Returns the minimum value from the stream. If empty, req_floats_sketch returns nan; req_ints_sketch throws a RuntimeError
- get_pmf
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.
- get_quantile
Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input stream so far. For req_floats_sketch: if the sketch is empty this returns nan. For req_ints_sketch: if the sketch is empty this throws a RuntimeError.
- get_quantiles
This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.
- get_rank
Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.
- get_rank_lower_bound
Returns an approximate lower bound on the given normalized rank. Normalized rank must be a value between 0.0 and 1.0 (inclusive); the number of standard deviations must be 1, 2, or 3.
- get_rank_upper_bound
Returns an approximate upper bound on the given normalized rank. Normalized rank must be a value between 0.0 and 1.0 (inclusive); the number of standard deviations must be 1, 2, or 3.
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if the sketch is in estimation mode, otherwise False
- is_hra
Returns True if the sketch is in High Rank Accuracy mode, otherwise False
- property k
The configured parameter k
- merge
Merges the provided sketch into this one
- property n
The length of the input stream
- property num_retained
The number of retained items (samples) in the sketch
- serialize
Serializes the sketch into a bytes object.
- to_string
Produces a string summary of the sketch
- update
Overloaded function.
update(self, item: float) -> None
Updates the sketch with the given value
update(self, array: ndarray[dtype=float32]) -> None
Updates the sketch with the values in the given array
- class req_items_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.req_items_sketch
Deserializes the sketch from a bytes object using the provided serde.
- get_RSE(k: int, rank: float, is_hra: bool, n: int) float
Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in http://arxiv.org/abs/2004.01668v2, but the constant factors have been modified based on empirical measurements, for a given value of parameter k. Normalized rank must be a value between 0.0 and 1.0 (inclusive). If is_hra is True, uses high rank accuracy mode, else low rank accuracy. N is an estimate of the total number of points provided to the sketch.
Non-static Methods:
- __init__(self, k: int = 12, is_hra: bool = True) None
Creates an REQ sketch instance with the given value of k.
- Parameters:
k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 12.
is_hra (bool, optional) – Specifies whether the skech has High Rank Accuracy (True) or Low Rank Accuracy. Default True.
- get_cdf
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.
- get_max_value
Returns the maximum value from the stream. If empty, req_floats_sketch returns nan; req_ints_sketch throws a RuntimeError
- get_min_value
Returns the minimum value from the stream. If empty, req_floats_sketch returns nan; req_ints_sketch throws a RuntimeError
- get_pmf
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.
- get_quantile
Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input stream so far. For req_floats_sketch: if the sketch is empty this returns nan. For req_ints_sketch: if the sketch is empty this throws a RuntimeError.
- get_quantiles
This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.
- get_rank
Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.
- get_rank_lower_bound
Returns an approximate lower bound on the given normalized rank. Normalized rank must be a value between 0.0 and 1.0 (inclusive); the number of standard deviations must be 1, 2, or 3.
- get_rank_upper_bound
Returns an approximate upper bound on the given normalized rank. Normalized rank must be a value between 0.0 and 1.0 (inclusive); the number of standard deviations must be 1, 2, or 3.
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if the sketch is in estimation mode, otherwise False
- is_hra
Returns True if the sketch is in High Rank Accuracy mode, otherwise False
- property k
The configured parameter k
- merge
Merges the provided sketch into this one
- property n
The length of the input stream
- property num_retained
The number of retained items (samples) in the sketch
- serialize
Serializes the sketch into a bytes object using the provided serde.
- to_string
Produces a string summary of the sketch
- update
Updates the sketch with the given value