Quantiles Sketch (Deprecated)
This is a deprecated quantiles sketch that is included for cross-language compatibility. Most new projects will favor the KLL sketch over this one, or the REQ sketch for higher accuracy at the very edge of a distribution.
This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution from a very large stream in a single pass. The analysis is obtained using get_rank() and get_quantile() functions, the Probability Mass Function from get_pmf()` and the Cumulative Distribution Function from get_cdf.
Consider a large stream of one million values such as packet sizes coming into a network node. The natural rank of any specific size value is its index in the hypothetical sorted array of values. The normalized rank is the natural rank divided by the stream size, in this case one million. The value corresponding to the normalized rank of 0.5 represents the 50th percentile or median value of the distribution, or get_quantile(0.5). Similarly, the 95th percentile is obtained from get_quantile(0.95).
From the min and max values, for example, 1 and 1000 bytes, you can obtain the PMF from get_pmf(100, 500, 900) that will result in an array of 4 fractional values such as {.4, .3, .2, .1}, which means that 40% of the values were < 100, 30% of the values were ≥ 100 and < 500, 20% of the values were ≥ 500 and < 900, and 10% of the values were ≥ 900. A frequency histogram can be obtained by multiplying these fractions by get_n(), which is the total count of values received. The get_cdf()` works similarly, but produces the cumulative distribution instead.
As of November 2021, this implementation produces serialized sketches which are binary-compatible with the equivalent Java implementation only when template parameter T = double (64-bit double precision values).
The accuracy of this sketch is a function of the configured value k, which also affects the overall size of the sketch. Accuracy of this quantile sketch is always with respect to the normalized rank. A k of 128 produces a normalized, rank error of about 1.7%. For example, the median item returned from get_quantile(0.5) will be between the actual items from the hypothetically sorted array of input items at normalized ranks of 0.483 and 0.517, with a confidence of about 99%.
Note
For the quantiles_items_sketch
, objects must be comparable with __lt__
.
Note
Serializing and deserializing a quantiles_items_sketch
requires the use of a PyObjectSerDe
.
- class quantiles_ints_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes) _datasketches.quantiles_ints_sketch
Deserializes the sketch from a bytes object.
- get_normalized_rank_error(k: int, as_pmf: bool) float
Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials
Non-static Methods:
- __init__(self, k: int = 128) None
Creates a classic quantiles sketch instance with the given value of k.
- Parameters:
k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 128.
- get_cdf
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
- get_max_value
Returns the maximum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError
- get_min_value
Returns the minimum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError
- get_pmf
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
- get_quantile
Returns an approximation to the data value associated with the given rank in a hypothetical sorted version of the input stream so far. For quantiles_floats_sketch: if the sketch is empty this returns nan. For quantiles_ints_sketch: if the sketch is empty this throws a RuntimeError.
- get_quantiles
This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.
- get_rank
Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if the sketch is in estimation mode, otherwise False
- property k
The configured parameter k
- merge
Merges the provided sketch into this one
- property n
The length of the input stream
- normalized_rank_error
Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials
- property num_retained
The number of retained items (samples) in the sketch
- serialize
Serializes the sketch into a bytes object.
- to_string
Produces a string summary of the sketch
- update
Overloaded function.
update(self, item: int) -> None
Updates the sketch with the given value
update(self, array: ndarray[dtype=int32]) -> None
Updates the sketch with the values in the given array
- class quantiles_floats_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes) _datasketches.quantiles_floats_sketch
Deserializes the sketch from a bytes object.
- get_normalized_rank_error(k: int, as_pmf: bool) float
Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials
Non-static Methods:
- __init__(self, k: int = 128) None
Creates a classic quantiles sketch instance with the given value of k.
- Parameters:
k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 128.
- get_cdf
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
- get_max_value
Returns the maximum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError
- get_min_value
Returns the minimum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError
- get_pmf
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
- get_quantile
Returns an approximation to the data value associated with the given rank in a hypothetical sorted version of the input stream so far. For quantiles_floats_sketch: if the sketch is empty this returns nan. For quantiles_ints_sketch: if the sketch is empty this throws a RuntimeError.
- get_quantiles
This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.
- get_rank
Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if the sketch is in estimation mode, otherwise False
- property k
The configured parameter k
- merge
Merges the provided sketch into this one
- property n
The length of the input stream
- normalized_rank_error
Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials
- property num_retained
The number of retained items (samples) in the sketch
- serialize
Serializes the sketch into a bytes object.
- to_string
Produces a string summary of the sketch
- update
Overloaded function.
update(self, item: float) -> None
Updates the sketch with the given value
update(self, array: ndarray[dtype=float32]) -> None
Updates the sketch with the values in the given array
- class quantiles_doubles_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes) _datasketches.quantiles_doubles_sketch
Deserializes the sketch from a bytes object.
- get_normalized_rank_error(k: int, as_pmf: bool) float
Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials
Non-static Methods:
- __init__(self, k: int = 128) None
Creates a classic quantiles sketch instance with the given value of k.
- Parameters:
k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 128.
- get_cdf
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
- get_max_value
Returns the maximum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError
- get_min_value
Returns the minimum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError
- get_pmf
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
- get_quantile
Returns an approximation to the data value associated with the given rank in a hypothetical sorted version of the input stream so far. For quantiles_floats_sketch: if the sketch is empty this returns nan. For quantiles_ints_sketch: if the sketch is empty this throws a RuntimeError.
- get_quantiles
This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.
- get_rank
Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if the sketch is in estimation mode, otherwise False
- property k
The configured parameter k
- merge
Merges the provided sketch into this one
- property n
The length of the input stream
- normalized_rank_error
Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials
- property num_retained
The number of retained items (samples) in the sketch
- serialize
Serializes the sketch into a bytes object.
- to_string
Produces a string summary of the sketch
- update
Overloaded function.
update(self, item: float) -> None
Updates the sketch with the given value
update(self, array: ndarray[dtype=float64]) -> None
Updates the sketch with the values in the given array
- class quantiles_items_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.quantiles_items_sketch
Deserializes the sketch from a bytes object using the provided serde.
- get_normalized_rank_error(k: int, as_pmf: bool) float
Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials
Non-static Methods:
- __init__(self, k: int = 128) None
Creates a classic quantiles sketch instance with the given value of k.
- Parameters:
k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 128.
- get_cdf
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
- get_max_value
Returns the maximum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError
- get_min_value
Returns the minimum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError
- get_pmf
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
- get_quantile
Returns an approximation to the data value associated with the given rank in a hypothetical sorted version of the input stream so far. For quantiles_floats_sketch: if the sketch is empty this returns nan. For quantiles_ints_sketch: if the sketch is empty this throws a RuntimeError.
- get_quantiles
This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.
- get_rank
Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.
- is_empty
Returns True if the sketch is empty, otherwise False
- is_estimation_mode
Returns True if the sketch is in estimation mode, otherwise False
- property k
The configured parameter k
- merge
Merges the provided sketch into this one
- property n
The length of the input stream
- normalized_rank_error
Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials
- property num_retained
The number of retained items (samples) in the sketch
- serialize
Serializes the sketch into a bytes object using the provided serde.
- to_string
Produces a string summary of the sketch
- update
Updates the sketch with the given value