Class ReqSketch

java.lang.Object
org.apache.datasketches.req.ReqSketch
All Implemented Interfaces:
QuantilesAPI, QuantilesFloatsAPI

public final class ReqSketch extends Object
This Relative Error Quantiles Sketch is the Java implementation based on the paper "Relative Error Streaming Quantiles" by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel Veselý, and loosely derived from a Python prototype written by Pavel Veselý.

Reference: https://arxiv.org/abs/2004.01668

This implementation differs from the algorithm described in the paper in the following:

  • The algorithm requires no upper bound on the stream length. Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). Initially, the relative-compactor starts with INIT_NUMBER_OF_SECTIONS. Each time the number of compactions (variable state) exceeds 2^{numSections - 1}, we double numSections. Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.
  • The size of each section (variable k and sectionSize in the code and parameter k in the paper) is initialized with a number set by the user via variable k. When the number of sections doubles, we decrease sectionSize by a factor of sqrt(2). This is applied at each level separately. Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. sqrt(2) (+/- rounding).
  • The merge operation here does not perform "special compactions", which are used in the paper to allow for a tight mathematical analysis of the sketch.

This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.

  • The Python prototype only implemented high accuracy for low ranks. This implementation provides the user with the ability to choose either high rank accuracy or low rank accuracy at the time of sketch construction.
  • The Python prototype only implemented a comparison criterion of "INCLUSIVE". This implementation allows the user to switch back and forth between the "INCLUSIVE" criterion and the "EXCLUSIVE" criterion.
  • This implementation provides extensive debug visibility into the operation of the sketch with two levels of detail output. This is not only useful for debugging, but is a powerful tool to help users understand how the sketch works.
Author:
Edo Liberty, Pavel Vesely, Lee Rhodes
See Also:
  • Field Summary

    Fields inherited from interface org.apache.datasketches.quantilescommon.QuantilesAPI

    EMPTY_MSG, MEM_REQ_SVR_NULL_MSG, NOT_SINGLE_ITEM_MSG, SELF_MERGE_MSG, TGT_IS_READ_ONLY_MSG, UNSUPPORTED_MSG
  • Method Summary

    Modifier and Type
    Method
    Description
    static final ReqSketchBuilder
    Returns a new ReqSketchBuilder
    double[]
    getCDF(float[] splitPoints, QuantileSearchCriteria searchCrit)
    Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as a monotonically increasing array of double ranks (or cumulative probabilities) on the interval [0.0, 1.0], given a set of splitPoints.
    boolean
    If true, the high ranks are prioritized for better accuracy.
    int
    Gets the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.
    float
    Returns the maximum item of the stream.
    float
    Returns the minimum item of the stream.
    long
    Gets the length of the input stream offered to the sketch..
    double
    Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
    int
    Gets the number of quantiles retained by the sketch.
    double[]
    getPMF(float[] splitPoints, QuantileSearchCriteria searchCrit)
    Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses as doubles on the interval [0.0, 1.0], given a set of splitPoints.
    float
    getQuantile(double normRank, QuantileSearchCriteria searchCrit)
    Gets the approximate quantile of the given normalized rank and the given search criterion.
    float
    getQuantileLowerBound(double rank)
    Gets the lower bound of the quantile confidence interval in which the quantile of the given rank exists.
    float
    getQuantileLowerBound(double rank, int numStdDev)
    Gets an approximate lower bound of the quantile associated with the given rank.
    float[]
    getQuantiles(double[] normRanks, QuantileSearchCriteria searchCrit)
    Gets an array of quantiles from the given array of normalized ranks.
    float
    getQuantileUpperBound(double rank)
    Gets the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.
    float
    getQuantileUpperBound(double rank, int numStdDev)
    Gets an approximate upper bound of the quantile associated with the given rank.
    double
    getRank(float quantile, QuantileSearchCriteria searchCrit)
    Gets the normalized rank corresponding to the given a quantile.
    double
    getRankLowerBound(double rank)
    Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists.
    double
    getRankLowerBound(double rank, int numStdDev)
    Gets an approximate lower bound rank of the given normalized rank.
    double[]
    getRanks(float[] quantiles, QuantileSearchCriteria searchCrit)
    Gets an array of normalized ranks corresponding to the given array of quantiles and the given search criterion.
    double
    getRankUpperBound(double rank)
    Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists.
    double
    getRankUpperBound(double rank, int numStdDev)
    Gets an approximate upper bound rank of the given rank.
    static double
    getRSE(int k, double rank, boolean hra, long totalN)
    Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).
    int
    Returns the current number of bytes this Sketch would require if serialized.
    Gets the sorted view of this sketch
    boolean
    Returns true if this sketch's data structure is backed by Memory or WritableMemory.
    static ReqSketch
    heapify(org.apache.datasketches.memory.Memory mem)
    Returns an ReqSketch on the heap from a Memory image of the sketch.
    boolean
    Returns true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
    boolean
    Returns true if this sketch is empty.
    boolean
    Returns true if this sketch is in estimation mode.
    boolean
    Returns true if this sketch is read only.
    Gets the iterator for this sketch, which is not sorted.
    Merge other sketch into this one.
    void
    Resets this sketch to the empty state.
    byte[]
    Returns a byte array representation of this sketch.
    Returns a summary of the key parameters of the sketch.
    void
    update(float item)
    Updates this sketch with the given item.
    viewCompactorDetail(String fmt, boolean allData)
    A detailed, human readable view of the sketch compactors and their data.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

    Methods inherited from interface org.apache.datasketches.quantilescommon.QuantilesFloatsAPI

    getCDF, getPMF, getQuantile, getQuantiles, getRank, getRanks
  • Method Details

    • builder

      public static final ReqSketchBuilder builder()
      Returns a new ReqSketchBuilder
      Returns:
      a new ReqSketchBuilder
    • heapify

      public static ReqSketch heapify(org.apache.datasketches.memory.Memory mem)
      Returns an ReqSketch on the heap from a Memory image of the sketch.
      Parameters:
      mem - The Memory object holding a valid image of an ReqSketch
      Returns:
      an ReqSketch on the heap from a Memory image of the sketch.
    • getK

      public int getK()
      Description copied from interface: QuantilesAPI
      Gets the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.
      Specified by:
      getK in interface QuantilesAPI
      Returns:
      the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.
    • getCDF

      public double[] getCDF(float[] splitPoints, QuantileSearchCriteria searchCrit)
      Description copied from interface: QuantilesFloatsAPI
      Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as a monotonically increasing array of double ranks (or cumulative probabilities) on the interval [0.0, 1.0], given a set of splitPoints.

      The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.

      Specified by:
      getCDF in interface QuantilesFloatsAPI
      Parameters:
      splitPoints - an array of m unique, monotonically increasing items (of the same type as the input items) that divide the item input domain into m+1 overlapping intervals.

      The start of each interval is below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and the end of the interval is the rank or cumulative probability corresponding to the split point.

      The (m+1)th interval represents 100% of the distribution represented by the sketch and consistent with the definition of a cumulative probability distribution, thus the (m+1)th rank or probability in the returned array is always 1.0.

      If a split point exactly equals a retained item of the sketch and the search criterion is:

      • INCLUSIVE, the resulting cumulative probability will include that item.
      • EXCLUSIVE, the resulting cumulative probability will not include the weight of that split point.

      It is not recommended to include either the minimum or maximum items of the input stream.

      searchCrit - the desired search criteria.
      Returns:
      a discrete CDF array of m+1 double ranks (or cumulative probabilities) on the interval [0.0, 1.0].
    • getHighRankAccuracyMode

      public boolean getHighRankAccuracyMode()
      If true, the high ranks are prioritized for better accuracy. Otherwise the low ranks are prioritized for better accuracy. This state is chosen during sketch construction.
      Returns:
      the high ranks accuracy state.
    • getMaxItem

      public float getMaxItem()
      Description copied from interface: QuantilesFloatsAPI
      Returns the maximum item of the stream. This is provided for convenience, but may be different from the largest item retained by the sketch algorithm.
      Specified by:
      getMaxItem in interface QuantilesFloatsAPI
      Returns:
      the maximum item of the stream
    • getMinItem

      public float getMinItem()
      Description copied from interface: QuantilesFloatsAPI
      Returns the minimum item of the stream. This is provided for convenience, but is distinct from the smallest item retained by the sketch algorithm.
      Specified by:
      getMinItem in interface QuantilesFloatsAPI
      Returns:
      the minimum item of the stream
    • getN

      public long getN()
      Description copied from interface: QuantilesAPI
      Gets the length of the input stream offered to the sketch..
      Specified by:
      getN in interface QuantilesAPI
      Returns:
      the length of the input stream offered to the sketch.
    • getNormalizedRankError

      public double getNormalizedRankError(boolean pmf)
      Description copied from interface: QuantilesAPI
      Gets the approximate rank error of this sketch normalized as a fraction between zero and one. The epsilon returned is a best fit to 99 percent confidence empirically measured max error in thousands of trials.
      Parameters:
      pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      Returns:
      if pmf is true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
    • getPMF

      public double[] getPMF(float[] splitPoints, QuantileSearchCriteria searchCrit)
      Description copied from interface: QuantilesFloatsAPI
      Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses as doubles on the interval [0.0, 1.0], given a set of splitPoints.

      The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(true) function.

      Specified by:
      getPMF in interface QuantilesFloatsAPI
      Parameters:
      splitPoints - an array of m unique, monotonically increasing items (of the same type as the input items) that divide the item input domain into m+1 consecutive, non-overlapping intervals.

      Each interval except for the end intervals starts with a split point and ends with the next split point in sequence.

      The first interval starts below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and ends with the first split point

      The last (m+1)th interval starts with the last split point and ends after the last item retained by the sketch corresponding to a rank or probability of 1.0.

      The sum of the probability masses of all (m+1) intervals is 1.0.

      If the search criterion is:

      • INCLUSIVE, and the upper split point of an interval equals an item retained by the sketch, the interval will include that item. If the lower split point equals an item retained by the sketch, the interval will exclude that item.
      • EXCLUSIVE, and the upper split point of an interval equals an item retained by the sketch, the interval will exclude that item. If the lower split point equals an item retained by the sketch, the interval will include that item.

      It is not recommended to include either the minimum or maximum items of the input stream.

      searchCrit - the desired search criteria.
      Returns:
      a PMF array of m+1 probability masses as doubles on the interval [0.0, 1.0].
    • getQuantile

      public float getQuantile(double normRank, QuantileSearchCriteria searchCrit)
      Description copied from interface: QuantilesFloatsAPI
      Gets the approximate quantile of the given normalized rank and the given search criterion.
      Specified by:
      getQuantile in interface QuantilesFloatsAPI
      Parameters:
      normRank - the given normalized rank, a double in the range [0.0, 1.0].
      searchCrit - If INCLUSIVE, the given rank includes all quantiles ≤ the quantile directly corresponding to the given rank. If EXCLUSIVE, he given rank includes all quantiles < the quantile directly corresponding to the given rank.
      Returns:
      the approximate quantile given the normalized rank.
      See Also:
    • getQuantiles

      public float[] getQuantiles(double[] normRanks, QuantileSearchCriteria searchCrit)
      Description copied from interface: QuantilesFloatsAPI
      Gets an array of quantiles from the given array of normalized ranks.
      Specified by:
      getQuantiles in interface QuantilesFloatsAPI
      Parameters:
      normRanks - the given array of normalized ranks, each of which must be in the interval [0.0,1.0].
      searchCrit - if INCLUSIVE, the given ranks include all quantiles ≤ the quantile directly corresponding to each rank.
      Returns:
      an array of quantiles corresponding to the given array of normalized ranks.
      See Also:
    • getQuantileLowerBound

      public float getQuantileLowerBound(double rank)
      Gets the lower bound of the quantile confidence interval in which the quantile of the given rank exists.

      Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile confidence interval as an additive or multiplicative percent of the true quantile.

      The approximate probability that the true quantile is within the confidence interval specified by the upper and lower quantile bounds for this sketch is 0.95.
      Specified by:
      getQuantileLowerBound in interface QuantilesFloatsAPI
      Parameters:
      rank - the given normalized rank
      Returns:
      the lower bound of the quantile confidence interval in which the quantile of the given rank exists.
    • getQuantileLowerBound

      public float getQuantileLowerBound(double rank, int numStdDev)
      Gets an approximate lower bound of the quantile associated with the given rank.
      Parameters:
      rank - the given normalized rank, a number between 0 and 1.0.
      numStdDev - the number of standard deviations. Must be 1, 2, or 3.
      Returns:
      an approximate lower bound quantile, if it exists.
    • getQuantileUpperBound

      public float getQuantileUpperBound(double rank)
      Gets the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.

      Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile interval as an additive or multiplicative percent of the true quantile.

      The approximate probability that the true quantile is within the confidence interval specified by the upper and lower quantile bounds for this sketch is 0.95.
      Specified by:
      getQuantileUpperBound in interface QuantilesFloatsAPI
      Parameters:
      rank - the given normalized rank
      Returns:
      the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.
    • getQuantileUpperBound

      public float getQuantileUpperBound(double rank, int numStdDev)
      Gets an approximate upper bound of the quantile associated with the given rank.
      Parameters:
      rank - the given normalized rank, a number between 0 and 1.0.
      numStdDev - the number of standard deviations. Must be 1, 2, or 3.
      Returns:
      an approximate upper bound quantile, if it exists.
    • getRank

      public double getRank(float quantile, QuantileSearchCriteria searchCrit)
      Description copied from interface: QuantilesFloatsAPI
      Gets the normalized rank corresponding to the given a quantile.
      Specified by:
      getRank in interface QuantilesFloatsAPI
      Parameters:
      quantile - the given quantile
      searchCrit - if INCLUSIVE the given quantile is included into the rank.
      Returns:
      the normalized rank corresponding to the given quantile.
      See Also:
    • getRankLowerBound

      public double getRankLowerBound(double rank)
      Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists. The approximate probability that the true rank is within the confidence interval specified by the upper and lower rank bounds for this sketch is 0.95.
      Parameters:
      rank - the given normalized rank.
      Returns:
      the lower bound of the rank confidence interval in which the true rank of the given rank exists.
    • getRankLowerBound

      public double getRankLowerBound(double rank, int numStdDev)
      Gets an approximate lower bound rank of the given normalized rank.
      Parameters:
      rank - the given normalized rank, a number between 0 and 1.0.
      numStdDev - the number of standard deviations. Must be 1, 2, or 3.
      Returns:
      an approximate lower bound rank.
    • getRanks

      public double[] getRanks(float[] quantiles, QuantileSearchCriteria searchCrit)
      Description copied from interface: QuantilesFloatsAPI
      Gets an array of normalized ranks corresponding to the given array of quantiles and the given search criterion.
      Specified by:
      getRanks in interface QuantilesFloatsAPI
      Parameters:
      quantiles - the given array of quantiles
      searchCrit - if INCLUSIVE, the given quantiles include the rank directly corresponding to each quantile.
      Returns:
      an array of normalized ranks corresponding to the given array of quantiles.
      See Also:
    • getRankUpperBound

      public double getRankUpperBound(double rank)
      Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists. The approximate probability that the true rank is within the confidence interval specified by the upper and lower rank bounds for this sketch is 0.95.
      Parameters:
      rank - the given normalized rank.
      Returns:
      the upper bound of the rank confidence interval in which the true rank of the given rank exists.
    • getRankUpperBound

      public double getRankUpperBound(double rank, int numStdDev)
      Gets an approximate upper bound rank of the given rank.
      Parameters:
      rank - the given rank, a number between 0 and 1.0.
      numStdDev - the number of standard deviations. Must be 1, 2, or 3.
      Returns:
      an approximate upper bound rank.
    • getNumRetained

      public int getNumRetained()
      Description copied from interface: QuantilesAPI
      Gets the number of quantiles retained by the sketch.
      Specified by:
      getNumRetained in interface QuantilesAPI
      Returns:
      the number of quantiles retained by the sketch
    • getSerializedSizeBytes

      public int getSerializedSizeBytes()
      Description copied from interface: QuantilesFloatsAPI
      Returns the current number of bytes this Sketch would require if serialized.
      Specified by:
      getSerializedSizeBytes in interface QuantilesFloatsAPI
      Returns:
      the number of bytes this sketch would require if serialized.
    • isEmpty

      public boolean isEmpty()
      Description copied from interface: QuantilesAPI
      Returns true if this sketch is empty.
      Specified by:
      isEmpty in interface QuantilesAPI
      Returns:
      true if this sketch is empty.
    • isEstimationMode

      public boolean isEstimationMode()
      Description copied from interface: QuantilesAPI
      Returns true if this sketch is in estimation mode.
      Specified by:
      isEstimationMode in interface QuantilesAPI
      Returns:
      true if this sketch is in estimation mode.
    • iterator

      public QuantilesFloatsSketchIterator iterator()
      Description copied from interface: QuantilesFloatsAPI
      Gets the iterator for this sketch, which is not sorted.
      Specified by:
      iterator in interface QuantilesFloatsAPI
      Returns:
      the iterator for this sketch
    • merge

      public ReqSketch merge(ReqSketch other)
      Merge other sketch into this one. The other sketch is not modified.
      Parameters:
      other - sketch to be merged into this one.
      Returns:
      this
    • reset

      public void reset()
      Resets this sketch to the empty state. If the sketch is read only this does nothing.

      The parameter k will not change.

      The parameters k, highRankAccuracy, and reqDebug will not change.

      Specified by:
      reset in interface QuantilesAPI
    • toByteArray

      public byte[] toByteArray()
      Description copied from interface: QuantilesFloatsAPI
      Returns a byte array representation of this sketch.
      Specified by:
      toByteArray in interface QuantilesFloatsAPI
      Returns:
      a byte array representation of this sketch.
    • toString

      public String toString()
      Description copied from interface: QuantilesAPI
      Returns a summary of the key parameters of the sketch.
      Specified by:
      toString in interface QuantilesAPI
      Returns:
      a summary of the key parameters of the sketch.
    • update

      public void update(float item)
      Description copied from interface: QuantilesFloatsAPI
      Updates this sketch with the given item.
      Specified by:
      update in interface QuantilesFloatsAPI
      Parameters:
      item - from a stream of quantiles. NaNs are ignored.
    • viewCompactorDetail

      public String viewCompactorDetail(String fmt, boolean allData)
      A detailed, human readable view of the sketch compactors and their data. Each compactor string is prepended by the compactor lgWeight, the current number of retained quantiles of the compactor and the current nominal capacity of the compactor.
      Parameters:
      fmt - the format string for the quantiles; example: "%4.0f".
      allData - all the retained quantiles for the sketch will be output by compactor level. Otherwise, just a summary will be output.
      Returns:
      a detailed view of the compactors and their data
    • getSortedView

      public FloatsSketchSortedView getSortedView()
      Description copied from interface: QuantilesFloatsAPI
      Gets the sorted view of this sketch
      Specified by:
      getSortedView in interface QuantilesFloatsAPI
      Returns:
      the sorted view of this sketch
    • getRSE

      public static double getRSE(int k, double rank, boolean hra, long totalN)
      Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in https://arxiv.org/abs/2004.01668v2, but the constant factors were adjusted based on empirical measurements.
      Parameters:
      k - the given size of k
      rank - the given normalized rank, a number in [0,1].
      hra - if true High Rank Accuracy mode is being selected, otherwise, Low Rank Accuracy.
      totalN - an estimate of the total number of items submitted to the sketch.
      Returns:
      an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).
    • hasMemory

      public boolean hasMemory()
      Description copied from interface: QuantilesAPI
      Returns true if this sketch's data structure is backed by Memory or WritableMemory.
      Specified by:
      hasMemory in interface QuantilesAPI
      Returns:
      true if this sketch's data structure is backed by Memory or WritableMemory.
    • isDirect

      public boolean isDirect()
      Description copied from interface: QuantilesAPI
      Returns true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
      Specified by:
      isDirect in interface QuantilesAPI
      Returns:
      true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
    • isReadOnly

      public boolean isReadOnly()
      Description copied from interface: QuantilesAPI
      Returns true if this sketch is read only.
      Specified by:
      isReadOnly in interface QuantilesAPI
      Returns:
      true if this sketch is read only.