KLL Sketch

Implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per retained item. See Optimal Quantile Approximation in Streams.

This is a stochastic streaming sketch that enables near real-time analysis of the approximate distribution of items from a very large stream in a single pass, requiring only that the items are comparable. The analysis is obtained using get_quantile() function or the inverse functions get_rank(), get_pmf() (Probability Mass Function), and get_cdf() (Cumulative Distribution Function).

As of May 2020, this implementation produces serialized sketches which are binary-compatible with the equivalent Java implementation only when template parameter T = float (32-bit single precision values).

Given an input stream of N items, the natural rank of any specific item is defined as its index (1 to N) in inclusive mode or (0 to N-1) in exclusive mode in the hypothetical sorted stream of all N input items.

The normalized rank (rank) of any specific item is defined as its natural rank divided by N. Thus, the normalized rank is between zero and one. In the documentation for this sketch natural rank is never used so any reference to just rank should be interpreted to mean normalized rank.

This sketch is configured with a parameter k, which affects the size of the sketch and its estimation error.

The estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger values of k result in smaller values of epsilon. Epsilon is always with respect to the rank and cannot be applied to the corresponding items.

The relationship between the normalized rank and the corresponding items can be viewed as a two-dimensional monotonic plot with the normalized rank on one axis and the corresponding items on the other axis. If the y-axis is specified as the item-axis and the x-axis as the normalized rank, then y = get_quantile(x) is a monotonically increasing function.

The function get_quantile(rank) translates ranks into corresponding quantiles. The functions get_rank(item), get_cdf(…) (Cumulative Distribution Function), and get_pmf(…) (Probability Mass Function) perform the opposite operation and translate items into ranks.

The get_pmf(…) function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each “bin” of the PMF has “double-sided” error from the upper and lower edges of the bin as a result of a subtraction, as the errors from the two edges can sometimes add.

The default k of 200 yields a “single-sided” epsilon of about 1.33% and a “double-sided” (PMF) epsilon of about 1.65%.

A get_quantile(rank) query has the following guarantees: - Let q = get_quantile(r) where r is the rank between zero and one. - The quantile q will be an item from the input stream. - Let true_rank be the true rank of q derived from the hypothetical sorted stream of all N items. - Let eps = get_normalized_rank_error(false). - Then r - eps ≤ true_rank ≤ r + eps with a confidence of 99%. Note that the error is on the rank, not the quantile.

A get_rank(item) query has the following guarantees: - Let r = get_rank(i) where i is an item between the min and max items of the input stream. - Let true_rank be the true rank of i derived from the hypothetical sorted stream of all N items. - Let eps = get_normalized_rank_error(false). - Then r - eps ≤ true_rank ≤ r + eps with a confidence of 99%.

A get_pmf() query has the following guarantees: - Let {r1, r2, …, r(m+1)} = get_pmf(s1, s2, …, sm) where s1, s2 are split points (items from the input domain) between the min and max items of the input stream. - Let mass_i = estimated mass between s_i and s_i+1. - Let true_mass be the true mass between the items of s_i, s_i+1 derived from the hypothetical sorted stream of all N items. - Let eps = get_normalized_rank_error(true). - then mass - eps ≤ true_mass ≤ mass + eps with a confidence of 99%. - r(m+1) includes the mass of all points larger than s_m.

A get_cdf(…) query has the following guarantees; - Let {r1, r2, …, r(m+1)} = get_cdf(s1, s2, …, sm) where s1, s2, … are split points (items from the input domain) between the min and max items of the input stream. - Let mass_i = r_(i+1) - r_i. - Let true_mass be the true mass between the true ranks of s_i, s_i+1 derived from the hypothetical sorted stream of all N items. - Let eps = get_normalized_rank_error(true). - then mass - eps ≤ true_mass ≤ mass + eps with a confidence of 99%. - 1 - r(m+1) includes the mass of all points larger than s_m.

From the above, it might seem like we could make some estimates to bound the item returned from a call to get_quantile(). The sketch, however, does not let us derive error bounds or confidences around items. Because errors are independent, we can approximately bracket a value as shown below, but there are no error estimates available. Additionally, the interval may be quite large for certain distributions. - Let q = get_quantile(r), the estimated quantile of rank r. - Let eps = get_normalized_rank_error(false). - Let q_lo = estimated quantile of rank (r - eps). - Let q_hi = estimated quantile of rank (r + eps). - Then q_lo ≤ q ≤ q_hi, with 99% confidence.

Note

For the kll_items_sketch, objects must be comparable with __lt__.

Note

Serializing and deserializing a kll_items_sketch requires the use of a PyObjectSerDe.

class kll_ints_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes) → _datasketches.kll_ints_sketch: Deserializes the sketch from a bytes object.

get_normalized_rank_error(k: int, as_pmf: bool) → float: Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

Non-static Methods:

__init__(self, k: int = 200) → None

Creates a KLL sketch instance with the given value of k.

Parameters:: k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 200.

get_cdf: Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.

get_max_value: Returns the maximum value from the stream. If empty, kll_floats_sketch returns nan; kll_ints_sketch throws a RuntimeError

get_min_value: Returns the minimum value from the stream. If empty, kll_floats_sketch returns nan; kll_ints_sketch throws a RuntimeError

get_pmf: Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.

get_quantile: Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input stream so far. For kll_floats_sketch: if the sketch is empty this returns nan. For kll_ints_sketch: if the sketch is empty this throws a RuntimeError.

get_quantiles: This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.

get_rank: Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.

is_empty: Returns True if the sketch is empty, otherwise False

is_estimation_mode: Returns True if the sketch is in estimation mode, otherwise False

property k: The configured parameter k

merge: Merges the provided sketch into this one

property n: The length of the input stream

normalized_rank_error: Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

property num_retained: The number of retained items (samples) in the sketch

serialize: Serializes the sketch into a bytes object.

to_string: Produces a string summary of the sketch

update

Overloaded function.

update(self, item: int) -> None

Updates the sketch with the given value

update(self, array: ndarray[dtype=int32]) -> None

Updates the sketch with the values in the given array

class kll_floats_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes) → _datasketches.kll_floats_sketch: Deserializes the sketch from a bytes object.

get_normalized_rank_error(k: int, as_pmf: bool) → float: Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

Non-static Methods:

__init__(self, k: int = 200) → None

Creates a KLL sketch instance with the given value of k.

Parameters:: k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 200.

get_cdf: Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.

get_max_value: Returns the maximum value from the stream. If empty, kll_floats_sketch returns nan; kll_ints_sketch throws a RuntimeError

get_min_value: Returns the minimum value from the stream. If empty, kll_floats_sketch returns nan; kll_ints_sketch throws a RuntimeError

get_pmf: Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.

get_quantile: Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input stream so far. For kll_floats_sketch: if the sketch is empty this returns nan. For kll_ints_sketch: if the sketch is empty this throws a RuntimeError.

get_quantiles: This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.

get_rank: Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.

is_empty: Returns True if the sketch is empty, otherwise False

is_estimation_mode: Returns True if the sketch is in estimation mode, otherwise False

property k: The configured parameter k

merge: Merges the provided sketch into this one

property n: The length of the input stream

normalized_rank_error: Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

property num_retained: The number of retained items (samples) in the sketch

serialize: Serializes the sketch into a bytes object.

to_string: Produces a string summary of the sketch

update

Overloaded function.

update(self, item: float) -> None

Updates the sketch with the given value

update(self, array: ndarray[dtype=float32]) -> None

Updates the sketch with the values in the given array

class kll_doubles_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes) → _datasketches.kll_doubles_sketch: Deserializes the sketch from a bytes object.

get_normalized_rank_error(k: int, as_pmf: bool) → float: Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

Non-static Methods:

__init__(self, k: int = 200) → None

Creates a KLL sketch instance with the given value of k.

Parameters:: k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 200.

get_cdf: Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.

get_max_value: Returns the maximum value from the stream. If empty, kll_floats_sketch returns nan; kll_ints_sketch throws a RuntimeError

get_min_value: Returns the minimum value from the stream. If empty, kll_floats_sketch returns nan; kll_ints_sketch throws a RuntimeError

get_pmf: Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.

get_quantile: Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input stream so far. For kll_floats_sketch: if the sketch is empty this returns nan. For kll_ints_sketch: if the sketch is empty this throws a RuntimeError.

get_quantiles: This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.

get_rank: Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.

is_empty: Returns True if the sketch is empty, otherwise False

is_estimation_mode: Returns True if the sketch is in estimation mode, otherwise False

property k: The configured parameter k

merge: Merges the provided sketch into this one

property n: The length of the input stream

normalized_rank_error: Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

property num_retained: The number of retained items (samples) in the sketch

serialize: Serializes the sketch into a bytes object.

to_string: Produces a string summary of the sketch

update

Overloaded function.

update(self, item: float) -> None

Updates the sketch with the given value

update(self, array: ndarray[dtype=float64]) -> None

Updates the sketch with the values in the given array

class kll_items_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) → _datasketches.kll_items_sketch: Deserializes the sketch from a bytes object using the provided serde.

get_normalized_rank_error(k: int, as_pmf: bool) → float: Gets the normalized rank error given parameters k and the pmf flag. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

Non-static Methods:

__init__(self, k: int = 200) → None

Creates a KLL sketch instance with the given value of k.

Parameters:: k (int, optional) – Controls the size/accuracy trade-off of the sketch. Default is 200.

get_cdf: Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.

get_max_value: Returns the maximum value from the stream. If empty, kll_floats_sketch returns nan; kll_ints_sketch throws a RuntimeError

get_min_value: Returns the minimum value from the stream. If empty, kll_floats_sketch returns nan; kll_ints_sketch throws a RuntimeError

get_pmf: Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (values). The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(True) function. If the sketch is empty this returns an empty vector. split_points is an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. If the parameter inclusive=false, the definition of an ‘interval’ is inclusive of the left split point (or minimum value) and exclusive of the right split point, with the exception that the last interval will include the maximum value. If the parameter inclusive=true, the definition of an ‘interval’ is exclusive of the left split point (or minimum value) and inclusive of the right split point. It is not necessary to include either the min or max values in these split points.

get_quantile: Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input stream so far. For kll_floats_sketch: if the sketch is empty this returns nan. For kll_ints_sketch: if the sketch is empty this throws a RuntimeError.

get_quantiles: This returns an array that could have been generated by using get_quantile() for each normalized rank separately. If the sketch is empty this returns an empty vector.

get_rank: Returns an approximation to the normalized rank of the given value from 0 to 1, inclusive. The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(False) function. With the parameter inclusive=true the weight of the given value is included into the rank.Otherwise the rank equals the sum of the weights of values less than the given value. If the sketch is empty this returns nan.

is_empty: Returns True if the sketch is empty, otherwise False

is_estimation_mode: Returns True if the sketch is in estimation mode, otherwise False

property k: The configured parameter k

merge: Merges the provided sketch into this one

property n: The length of the input stream

normalized_rank_error: Gets the normalized rank error for this sketch. If pmf is True, returns the ‘double-sided’ normalized rank error for the get_PMF() function. Otherwise, it is the ‘single-sided’ normalized rank error for all the other queries. Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials

property num_retained: The number of retained items (samples) in the sketch

serialize: Serializes the sketch into a bytes object using the provided serde.

to_string: Produces a string summary of the sketch

update: Updates the sketch with the given value