Quantiles Sketch (Deprecated)

This is a deprecated quantiles sketch that is included for cross-language compatibility. Most new projects will favor the KLL sketch over this one, or the REQ sketch for higher accuracy at the very edge of a distribution.

This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution from a very large stream in a single pass. The analysis is obtained using get_rank() and get_quantile() functions, the Probability Mass Function from get_pmf()` and the Cumulative Distribution Function from get_cdf.

Consider a large stream of one million values such as packet sizes coming into a network node. The natural rank of any specific size value is its index in the hypothetical sorted array of values. The normalized rank is the natural rank divided by the stream size, in this case one million. The value corresponding to the normalized rank of 0.5 represents the 50th percentile or median value of the distribution, or get_quantile(0.5). Similarly, the 95th percentile is obtained from get_quantile(0.95).

From the min and max values, for example, 1 and 1000 bytes, you can obtain the PMF from get_pmf(100, 500, 900) that will result in an array of 4 fractional values such as {.4, .3, .2, .1}, which means that 40% of the values were < 100, 30% of the values were ≥ 100 and < 500, 20% of the values were ≥ 500 and < 900, and 10% of the values were ≥ 900. A frequency histogram can be obtained by multiplying these fractions by get_n(), which is the total count of values received. The get_cdf()` works similarly, but produces the cumulative distribution instead.

As of November 2021, this implementation produces serialized sketches which are binary-compatible with the equivalent Java implementation only when template parameter T = double (64-bit double precision values).

The accuracy of this sketch is a function of the configured value k, which also affects the overall size of the sketch. Accuracy of this quantile sketch is always with respect to the normalized rank. A k of 128 produces a normalized, rank error of about 1.7%. For example, the median item returned from get_quantile(0.5) will be between the actual items from the hypothetically sorted array of input items at normalized ranks of 0.483 and 0.517, with a confidence of about 99%.

Note

For the quantiles_items_sketch, objects must be comparable with __lt__.

Note

Serializing and deserializing a quantiles_items_sketch requires the use of a PyObjectSerDe.

class quantiles_ints_sketch

Static Methods:

deserialize(bytes: bytes) _datasketches.quantiles_ints_sketch

Deserializes the sketch from a bytes object.

get_normalized_rank_error(k: int, as_pmf: bool) float

Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

Non-static Methods:

__init__(self, k: int = 128) None

Creates a classic quantiles sketch instance with the given value of k.

Parameters:

k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 128.

get_cdf

Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.

get_max_value

Returns the maximum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError

get_min_value

Returns the minimum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError

get_pmf

Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.

get_quantile

Returns an approximation to the data value associated with the given rank in a hypothetical sorted version of the input stream so far. For quantiles_floats_sketch: if the sketch is empty this returns nan. For quantiles_ints_sketch: if the sketch is empty this throws a RuntimeError.

get_quantiles

This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.

get_rank

Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.

is_empty

Returns True if the sketch is empty, otherwise False

is_estimation_mode

Returns True if the sketch is in estimation mode, otherwise False

property k

The configured parameter k

merge

Merges the provided sketch into this one

property n

The length of the input stream

normalized_rank_error

Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

property num_retained

The number of retained items (samples) in the sketch

serialize

Serializes the sketch into a bytes object.

to_string

Produces a string summary of the sketch

update

Overloaded function.

  1. update(self, item: int) -> None

Updates the sketch with the given value

  1. update(self, array: ndarray[dtype=int32]) -> None

Updates the sketch with the values in the given array

class quantiles_floats_sketch

Static Methods:

deserialize(bytes: bytes) _datasketches.quantiles_floats_sketch

Deserializes the sketch from a bytes object.

get_normalized_rank_error(k: int, as_pmf: bool) float

Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

Non-static Methods:

__init__(self, k: int = 128) None

Creates a classic quantiles sketch instance with the given value of k.

Parameters:

k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 128.

get_cdf

Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.

get_max_value

Returns the maximum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError

get_min_value

Returns the minimum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError

get_pmf

Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.

get_quantile

Returns an approximation to the data value associated with the given rank in a hypothetical sorted version of the input stream so far. For quantiles_floats_sketch: if the sketch is empty this returns nan. For quantiles_ints_sketch: if the sketch is empty this throws a RuntimeError.

get_quantiles

This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.

get_rank

Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.

is_empty

Returns True if the sketch is empty, otherwise False

is_estimation_mode

Returns True if the sketch is in estimation mode, otherwise False

property k

The configured parameter k

merge

Merges the provided sketch into this one

property n

The length of the input stream

normalized_rank_error

Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

property num_retained

The number of retained items (samples) in the sketch

serialize

Serializes the sketch into a bytes object.

to_string

Produces a string summary of the sketch

update

Overloaded function.

  1. update(self, item: float) -> None

Updates the sketch with the given value

  1. update(self, array: ndarray[dtype=float32]) -> None

Updates the sketch with the values in the given array

class quantiles_doubles_sketch

Static Methods:

deserialize(bytes: bytes) _datasketches.quantiles_doubles_sketch

Deserializes the sketch from a bytes object.

get_normalized_rank_error(k: int, as_pmf: bool) float

Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

Non-static Methods:

__init__(self, k: int = 128) None

Creates a classic quantiles sketch instance with the given value of k.

Parameters:

k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 128.

get_cdf

Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.

get_max_value

Returns the maximum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError

get_min_value

Returns the minimum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError

get_pmf

Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.

get_quantile

Returns an approximation to the data value associated with the given rank in a hypothetical sorted version of the input stream so far. For quantiles_floats_sketch: if the sketch is empty this returns nan. For quantiles_ints_sketch: if the sketch is empty this throws a RuntimeError.

get_quantiles

This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.

get_rank

Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.

is_empty

Returns True if the sketch is empty, otherwise False

is_estimation_mode

Returns True if the sketch is in estimation mode, otherwise False

property k

The configured parameter k

merge

Merges the provided sketch into this one

property n

The length of the input stream

normalized_rank_error

Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

property num_retained

The number of retained items (samples) in the sketch

serialize

Serializes the sketch into a bytes object.

to_string

Produces a string summary of the sketch

update

Overloaded function.

  1. update(self, item: float) -> None

Updates the sketch with the given value

  1. update(self, array: ndarray[dtype=float64]) -> None

Updates the sketch with the values in the given array

class quantiles_items_sketch

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.quantiles_items_sketch

Deserializes the sketch from a bytes object using the provided serde.

get_normalized_rank_error(k: int, as_pmf: bool) float

Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

Non-static Methods:

__init__(self, k: int = 128) None

Creates a classic quantiles sketch instance with the given value of k.

Parameters:

k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 128.

get_cdf

Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.

get_max_value

Returns the maximum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError

get_min_value

Returns the minimum value from the stream. If empty, quantiles_floats_sketch returns nan; quantiles_ints_sketch throws a RuntimeError

get_pmf

Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.

get_quantile

Returns an approximation to the data value associated with the given rank in a hypothetical sorted version of the input stream so far. For quantiles_floats_sketch: if the sketch is empty this returns nan. For quantiles_ints_sketch: if the sketch is empty this throws a RuntimeError.

get_quantiles

This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.

get_rank

Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.

is_empty

Returns True if the sketch is empty, otherwise False

is_estimation_mode

Returns True if the sketch is in estimation mode, otherwise False

property k

The configured parameter k

merge

Merges the provided sketch into this one

property n

The length of the input stream

normalized_rank_error

Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

property num_retained

The number of retained items (samples) in the sketch

serialize

Serializes the sketch into a bytes object using the provided serde.

to_string

Produces a string summary of the sketch

update

Updates the sketch with the given value