All Known Subinterfaces:: QuantilesDoublesAPI, QuantilesFloatsAPI, QuantilesGenericAPI<T>, QuantilesLongsAPI

All Known Implementing Classes:: CompactDoublesSketch, DoublesSketch, ItemsSketch, KllDoublesSketch, KllFloatsSketch, KllItemsSketch, KllLongsSketch, KllSketch, ReqSketch, UpdateDoublesSketch

public interface QuantilesAPI

This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution of items from a very large stream in a single pass, requiring only that the items are comparable. The analysis is obtained using the getQuantile() function or the inverse functions getRank(), getPMF() (the Probability Mass Function), and getCDF() (the Cumulative Distribution Function).

Given an input stream of N items, the natural rank of any specific item is defined as its index (1 to N) in the hypothetical sorted stream of all N input items.

The normalized rank (rank) of any specific item is defined as its natural rank divided by N, which is a number in the interval [0.0, 1.0]. In the Javadocs for all the quantile sketches natural rank is seldom used so any reference to just rank should be interpreted as normalized rank.

Inputs into a quantile sketch are called "items" that can be either generic or specific primitives, like float or double depending on the sketch implementation. In order to keep its size small, sketches don't retain all the items offered and retain only a small fraction of all the items, thus purging most of the items. The items retained are then sorted and associated with a rank. At this point we call the retained items quantiles. Thus, all quantiles are items, but only a few items become quantiles. Depending on the context the two terms can be interchangeable.

All quantile sketches are configured with a parameter k, which affects the size of the sketch and its estimation error.

In the research literature, the estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger sizes of k result in a smaller epsilon, but also a larger sketch. The epsilon error is always with respect to the rank domain. Estimating the confidence interval in the quantile domain can be done by first computing the error in the rank domain and then translating that to the quantile domain. The sketch provides methods to assist with that.

The relationship between the normalized rank and the corresponding quantiles can be viewed as a two dimensional monotonic plot with the normalized rank on one axis and the corresponding quantiles on the other axis. Let q := quantile and r := rank then both q = getQuantile(r) and r = getRank(q) are monotonically increasing functions. If the y-axis is used for the rank domain and the x-axis for the quantile domain, then y = getRank(x) is also the single point Cumulative Distribution Function (CDF).

The functions getQuantile() translate ranks into corresponding quantiles. The functions getRank(), getCDF(), and getPMF() (Probability Mass Function) perform the opposite operation and translate quantiles into ranks (or cumulative probabilities, or probability masses, depending on the context).

As an example, consider a large stream of one million items such as packet sizes coming into a network node. The absolute rank of any specific item size is simply its index in the hypothetical sorted array of such items. The normalized rank is the natural rank divided by the stream size, or N, in this case one million. The quantile corresponding to the normalized rank of 0.5 represents the 50th percentile or median of the distribution, obtained from getQuantile(0.5). Similarly, the 95th percentile is obtained from getQuantile(0.95).

From the min and max quantiles, for example, say 1 and 1000 bytes, you can obtain the PMF from getPMF(100, 500, 900) that will result in an array of 4 probability masses such as {.4, .3, .2, .1}, which means that

40% of the mass was < 100,
30% of the mass was ≥ 100 and < 500,
20% of the mass was ≥ 500 and < 900, and
10% of the mass was ≥ 900.

A frequency histogram can be obtained by simply multiplying these probability masses by getN(), which is the total count of items received. The getCDF() works similarly, but produces the cumulative distribution instead.

The accuracy of this sketch is a function of the configured k, which also affects the overall size of the sketch. Accuracy of this quantile sketch is always with respect to the normalized rank.

The getPMF() function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each "bin" of the PMF has "double-sided" error from the upper and lower edges of the bin as a result of a subtraction of random variables where the errors from the two edges can sometimes add.

A getQuantile(rank) query has the following probabilistic guarantees:

Let q = getQuantile(r) where r is the rank between zero and one.
The quantile q will be a quantile from the input stream.
Let trueRank be the true rank of q derived from the hypothetical sorted stream of all N quantiles.
Let eps = getNormalizedRankError(false)[*].
Then r - eps ≤ trueRank ≤ r + eps. Note that the error is on the rank, not the quantile.

A getRank(quantile) query has the following probabilistic guarantees:

Let r = getRank(q) where q is a quantile between the min and max quantiles of the input stream.
Let trueRank be the true rank of q derived from the hypothetical sorted stream of all N quantiles.
Let eps = getNormalizedRankError(false)[*].
Then r - eps ≤ trueRank ≤ r + eps.

A getPMF() query has the following probabilistic guarantees:

Let {r₁, r₂, ..., r_m+1} = getPMF(v₁, v₂, ..., v_m) where q₁, q₂, ..., q_m are monotonically increasing quantiles supplied by the user that are part of the monotonic sequence q₀ = min, q₁, q₂, ..., q_m, q_m+1 = max, and where min and max are the actual minimum and maximum quantiles of the input stream automatically included in the sequence by the getPMF(...) function.
Let r_i = mass_i = estimated mass between v_i-1 and q_i where q₀ = min and q_m+1 = max.
Let trueMass be the true mass between the quantiles of q_i, q_i+1 derived from the hypothetical sorted stream of all N quantiles.
Let eps = getNormalizedRankError(true)[*].
Then mass - eps ≤ trueMass ≤ mass + eps.
r₁ includes the mass of all points between min = q₀ and q₁.
r_m+1 includes the mass of all points between q_m and max = q_m+1.

A getCDF(...) query has the following probabilistic guarantees:

Let {r₁, r₂, ..., r_m+1} = getCDF(q₁, q₂, ..., q_m) where q₁, q₂, ..., q_m) are monotonically increasing quantiles supplied by the user that are part of the monotonic sequence {q₀ = min, q₁, q₂, ..., q_m, q_m+1 = max}, and where min and max are the actual minimum and maximum quantiles of the input stream automatically included in the sequence by the getCDF(...) function.
Let r_i = mass_i = estimated mass between q₀ = min and q_i.
Let trueMass be the true mass between the true ranks of q_i, q_i+1 derived from the hypothetical sorted stream of all N quantiles.
Let eps = getNormalizedRankError(true)[*].
then mass - eps ≤ trueMass ≤ mass + eps.
r₁ includes the mass of all points between min = q₀ and q₁.
r_m+1 includes the mass of all points between min = q₀ and max = q_m+1.

Because errors are independent, we can make some estimates of the size of the confidence bounds for the quantile returned from a call to getQuantile(), but not error bounds. These confidence bounds may be quite large for certain distributions.

Let q = getQuantile(r), the estimated quantile of rank r.
Let eps = getNormalizedRankError(false)[*].
Let q_lo = estimated quantile of rank (r - eps).
Let q_hi = estimated quantile of rank (r + eps).
Then q_lo ≤ q ≤ q_hi.

This sketch is order and distribution insensitive

This algorithm intentionally inserts randomness into the sampling process for items that ultimately get retained in the sketch. Thus, the results produced by this algorithm are not deterministic. For example, if the same stream is inserted into two different instances of this sketch, the answers obtained from the two sketches should be close, but may not be be identical.

Similarly, there may be directional inconsistencies. For example, if a quantile obtained from getQuantile(rank) is input into the reverse query getRank(quantile), the resulting rank should be close, but may not exactly equal the original rank.

Please visit our website: DataSketches Home Page and specific Javadocs for more information.

[*] Note that obtaining epsilon may require using a similar function but with more parameters based on the specific sketch implementation.

Author:

Lee Rhodes, Kevin Lang, Alexander Saydakov

See Also:

Field Summary

Fields

Modifier and Type

Field

Description

static final String

EMPTY_MSG

The sketch must not be empty for this operation.

static final String

MEM_REQ_SVR_NULL_MSG

MemoryRequestServer must not be null.

static final String

NOT_SINGLE_ITEM_MSG

Sketch does not have just one item.

static final String

SELF_MERGE_MSG

A sketch cannot merge with itself.

static final String

TGT_IS_READ_ONLY_MSG

Target sketch is Read Only, cannot write.

static final String

UNSUPPORTED_MSG

Unsupported operation for this Sketch Type.
Method Summary

Modifier and Type

Method

Description

int

getK()

Gets the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.

long

getN()

Gets the length of the input stream offered to the sketch..

double

getNormalizedRankError(boolean pmf)

Gets the approximate rank error of this sketch normalized as a fraction between zero and one.

int

getNumRetained()

Gets the number of quantiles retained by the sketch.

double

getRankLowerBound(double rank)

Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists.

double

getRankUpperBound(double rank)

Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists.

boolean

hasMemory()

Returns true if this sketch's data structure is backed by Memory or WritableMemory.

boolean

isDirect()

Returns true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).

boolean

isEmpty()

Returns true if this sketch is empty.

boolean

isEstimationMode()

Returns true if this sketch is in estimation mode.

boolean

isReadOnly()

Returns true if this sketch is read only.

void

reset()

Resets this sketch to the empty state.

String

toString()

Returns a summary of the key parameters of the sketch.

Field Details
- EMPTY_MSG
  
  static final String EMPTY_MSG
  
  The sketch must not be empty for this operation.
  See Also:
  
  Constant Field Values
- UNSUPPORTED_MSG
  
  static final String UNSUPPORTED_MSG
  
  Unsupported operation for this Sketch Type.
  See Also:
  
  Constant Field Values
- NOT_SINGLE_ITEM_MSG
  
  static final String NOT_SINGLE_ITEM_MSG
  
  Sketch does not have just one item.
  See Also:
  
  Constant Field Values
- MEM_REQ_SVR_NULL_MSG
  
  static final String MEM_REQ_SVR_NULL_MSG
  
  MemoryRequestServer must not be null.
  See Also:
  
  Constant Field Values
- TGT_IS_READ_ONLY_MSG
  
  static final String TGT_IS_READ_ONLY_MSG
  
  Target sketch is Read Only, cannot write.
  See Also:
  
  Constant Field Values
- SELF_MERGE_MSG
  
  static final String SELF_MERGE_MSG
  
  A sketch cannot merge with itself.
  See Also:
  
  Constant Field Values
Method Details
- getK
  
  int getK()
  
  Gets the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.
  
  Returns:
  
  the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.
- getN
  
  long getN()
  
  Gets the length of the input stream offered to the sketch..
  
  Returns:
  
  the length of the input stream offered to the sketch.
- getNormalizedRankError
  
  double getNormalizedRankError(boolean pmf)
  
  Gets the approximate rank error of this sketch normalized as a fraction between zero and one. The epsilon returned is a best fit to 99 percent confidence empirically measured max error in thousands of trials.
  
  Parameters:
  
  pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
  
  Returns:
  
  if pmf is true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
- getNumRetained
  
  int getNumRetained()
  
  Gets the number of quantiles retained by the sketch.
  
  Returns:
  
  the number of quantiles retained by the sketch
- getRankLowerBound
  
  double getRankLowerBound(double rank)
  
  Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists.
  
  Parameters:
  
  rank - the given normalized rank.
  
  Returns:
  
  the lower bound of the rank confidence interval in which the true rank of the given rank exists.
- getRankUpperBound
  
  double getRankUpperBound(double rank)
  
  Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists.
  
  Parameters:
  
  rank - the given normalized rank.
  
  Returns:
  
  the upper bound of the rank confidence interval in which the true rank of the given rank exists.
- hasMemory
  
  boolean hasMemory()
  
  Returns true if this sketch's data structure is backed by Memory or WritableMemory.
  
  Returns:
  
  true if this sketch's data structure is backed by Memory or WritableMemory.
- isDirect
  
  boolean isDirect()
  
  Returns true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
  
  Returns:
  
  true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
- isEmpty
  
  boolean isEmpty()
  
  Returns true if this sketch is empty.
  
  Returns:
  
  true if this sketch is empty.
- isEstimationMode
  
  boolean isEstimationMode()
  
  Returns true if this sketch is in estimation mode.
  
  Returns:
  
  true if this sketch is in estimation mode.
- isReadOnly
  
  boolean isReadOnly()
  
  Returns true if this sketch is read only.
  
  Returns:
  
  true if this sketch is read only.
- reset
  
  void reset()
  
  Resets this sketch to the empty state. If the sketch is read only this does nothing.
  The parameter k will not change.
- toString
  
  String toString()
  
  Returns a summary of the key parameters of the sketch.
  
  Overrides:
  
  toString in class Object
  
  Returns:
  
  a summary of the key parameters of the sketch.

Interface QuantilesAPI

Field Summary

Method Summary

Field Details

EMPTY_MSG

UNSUPPORTED_MSG

NOT_SINGLE_ITEM_MSG

MEM_REQ_SVR_NULL_MSG

TGT_IS_READ_ONLY_MSG

SELF_MERGE_MSG

Method Details

getK

getN

getNormalizedRankError

getNumRetained

getRankLowerBound

getRankUpperBound

hasMemory

isDirect

isEmpty

isEstimationMode

isReadOnly

reset

toString