Class ReqSketch

  • All Implemented Interfaces:
    QuantilesAPI, QuantilesFloatsAPI

    public final class ReqSketch
    extends Object
    This Relative Error Quantiles Sketch is the Java implementation based on the paper "Relative Error Streaming Quantiles" by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel VeselĂ˝, and loosely derived from a Python prototype written by Pavel VeselĂ˝.

    Reference: https://arxiv.org/abs/2004.01668

    This implementation differs from the algorithm described in the paper in the following:

    • The algorithm requires no upper bound on the stream length. Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). Initially, the relative-compactor starts with INIT_NUMBER_OF_SECTIONS. Each time the number of compactions (variable state) exceeds 2^{numSections - 1}, we double numSections. Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.
    • The size of each section (variable k and sectionSize in the code and parameter k in the paper) is initialized with a number set by the user via variable k. When the number of sections doubles, we decrease sectionSize by a factor of sqrt(2). This is applied at each level separately. Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. sqrt(2) (+/- rounding).
    • The merge operation here does not perform "special compactions", which are used in the paper to allow for a tight mathematical analysis of the sketch.

    This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.

    • The Python prototype only implemented high accuracy for low ranks. This implementation provides the user with the ability to choose either high rank accuracy or low rank accuracy at the time of sketch construction.
    • The Python prototype only implemented a comparison criterion of "INCLUSIVE". This implementation allows the user to switch back and forth between the "INCLUSIVE" criterion and the "EXCLUSIVE" criterion.
    • This implementation provides extensive debug visibility into the operation of the sketch with two levels of detail output. This is not only useful for debugging, but is a powerful tool to help users understand how the sketch works.
    Author:
    Edo Liberty, Pavel Vesely, Lee Rhodes
    See Also:
    QuantilesAPI
    • Method Detail

      • builder

        public static final ReqSketchBuilder builder()
        Returns a new ReqSketchBuilder
        Returns:
        a new ReqSketchBuilder
      • heapify

        public static ReqSketch heapify​(org.apache.datasketches.memory.Memory mem)
        Returns an ReqSketch on the heap from a Memory image of the sketch.
        Parameters:
        mem - The Memory object holding a valid image of an ReqSketch
        Returns:
        an ReqSketch on the heap from a Memory image of the sketch.
      • getK

        public int getK()
        Description copied from interface: QuantilesAPI
        Gets the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.
        Specified by:
        getK in interface QuantilesAPI
        Returns:
        the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.
      • getCDF

        public double[] getCDF​(float[] splitPoints,
                               QuantileSearchCriteria searchCrit)
        Description copied from interface: QuantilesFloatsAPI
        Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as a monotonically increasing array of double ranks (or cumulative probabilities) on the interval [0.0, 1.0], given a set of splitPoints.

        The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.

        Specified by:
        getCDF in interface QuantilesFloatsAPI
        Parameters:
        splitPoints - an array of m unique, monotonically increasing items (of the same type as the input items) that divide the item input domain into m+1 overlapping intervals.

        The start of each interval is below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and the end of the interval is the rank or cumulative probability corresponding to the split point.

        The (m+1)th interval represents 100% of the distribution represented by the sketch and consistent with the definition of a cumulative probability distribution, thus the (m+1)th rank or probability in the returned array is always 1.0.

        If a split point exactly equals a retained item of the sketch and the search criterion is:

        • INCLUSIVE, the resulting cumulative probability will include that item.
        • EXCLUSIVE, the resulting cumulative probability will not include the weight of that split point.

        It is not recommended to include either the minimum or maximum items of the input stream.

        searchCrit - the desired search criteria.
        Returns:
        a discrete CDF array of m+1 double ranks (or cumulative probabilities) on the interval [0.0, 1.0].
      • getHighRankAccuracyMode

        public boolean getHighRankAccuracyMode()
        If true, the high ranks are prioritized for better accuracy. Otherwise the low ranks are prioritized for better accuracy. This state is chosen during sketch construction.
        Returns:
        the high ranks accuracy state.
      • getMaxItem

        public float getMaxItem()
        Description copied from interface: QuantilesFloatsAPI
        Returns the maximum item of the stream. This is provided for convenience, but may be different from the largest item retained by the sketch algorithm.
        Specified by:
        getMaxItem in interface QuantilesFloatsAPI
        Returns:
        the maximum item of the stream
      • getMinItem

        public float getMinItem()
        Description copied from interface: QuantilesFloatsAPI
        Returns the minimum item of the stream. This is provided for convenience, but is distinct from the smallest item retained by the sketch algorithm.
        Specified by:
        getMinItem in interface QuantilesFloatsAPI
        Returns:
        the minimum item of the stream
      • getN

        public long getN()
        Description copied from interface: QuantilesAPI
        Gets the length of the input stream offered to the sketch..
        Specified by:
        getN in interface QuantilesAPI
        Returns:
        the length of the input stream offered to the sketch.
      • getNormalizedRankError

        public double getNormalizedRankError​(boolean pmf)
        Description copied from interface: QuantilesAPI
        Gets the approximate rank error of this sketch normalized as a fraction between zero and one. The epsilon returned is a best fit to 99 percent confidence empirically measured max error in thousands of trials.
        Parameters:
        pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
        Returns:
        if pmf is true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      • getPMF

        public double[] getPMF​(float[] splitPoints,
                               QuantileSearchCriteria searchCrit)
        Description copied from interface: QuantilesFloatsAPI
        Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses as doubles on the interval [0.0, 1.0], given a set of splitPoints.

        The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(true) function.

        Specified by:
        getPMF in interface QuantilesFloatsAPI
        Parameters:
        splitPoints - an array of m unique, monotonically increasing items (of the same type as the input items) that divide the item input domain into m+1 consecutive, non-overlapping intervals.

        Each interval except for the end intervals starts with a split point and ends with the next split point in sequence.

        The first interval starts below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and ends with the first split point

        The last (m+1)th interval starts with the last split point and ends after the last item retained by the sketch corresponding to a rank or probability of 1.0.

        The sum of the probability masses of all (m+1) intervals is 1.0.

        If the search criterion is:

        • INCLUSIVE, and the upper split point of an interval equals an item retained by the sketch, the interval will include that item. If the lower split point equals an item retained by the sketch, the interval will exclude that item.
        • EXCLUSIVE, and the upper split point of an interval equals an item retained by the sketch, the interval will exclude that item. If the lower split point equals an item retained by the sketch, the interval will include that item.

        It is not recommended to include either the minimum or maximum items of the input stream.

        searchCrit - the desired search criteria.
        Returns:
        a PMF array of m+1 probability masses as doubles on the interval [0.0, 1.0].
      • getQuantile

        public float getQuantile​(double normRank,
                                 QuantileSearchCriteria searchCrit)
        Description copied from interface: QuantilesFloatsAPI
        Gets the approximate quantile of the given normalized rank and the given search criterion.
        Specified by:
        getQuantile in interface QuantilesFloatsAPI
        Parameters:
        normRank - the given normalized rank, a double in the range [0.0, 1.0].
        searchCrit - If INCLUSIVE, the given rank includes all quantiles ≤ the quantile directly corresponding to the given rank. If EXCLUSIVE, he given rank includes all quantiles < the quantile directly corresponding to the given rank.
        Returns:
        the approximate quantile given the normalized rank.
        See Also:
        QuantileSearchCriteria
      • getQuantiles

        public float[] getQuantiles​(double[] normRanks,
                                    QuantileSearchCriteria searchCrit)
        Description copied from interface: QuantilesFloatsAPI
        Gets an array of quantiles from the given array of normalized ranks.
        Specified by:
        getQuantiles in interface QuantilesFloatsAPI
        Parameters:
        normRanks - the given array of normalized ranks, each of which must be in the interval [0.0,1.0].
        searchCrit - if INCLUSIVE, the given ranks include all quantiles ≤ the quantile directly corresponding to each rank.
        Returns:
        an array of quantiles corresponding to the given array of normalized ranks.
        See Also:
        QuantileSearchCriteria
      • getQuantileLowerBound

        public float getQuantileLowerBound​(double rank)
        Gets the lower bound of the quantile confidence interval in which the quantile of the given rank exists.

        Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile confidence interval as an additive or multiplicative percent of the true quantile.

        The approximate probability that the true quantile is within the confidence interval specified by the upper and lower quantile bounds for this sketch is 0.95.
        Specified by:
        getQuantileLowerBound in interface QuantilesFloatsAPI
        Parameters:
        rank - the given normalized rank
        Returns:
        the lower bound of the quantile confidence interval in which the quantile of the given rank exists.
      • getQuantileLowerBound

        public float getQuantileLowerBound​(double rank,
                                           int numStdDev)
      • getQuantileUpperBound

        public float getQuantileUpperBound​(double rank)
        Gets the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.

        Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile interval as an additive or multiplicative percent of the true quantile.

        The approximate probability that the true quantile is within the confidence interval specified by the upper and lower quantile bounds for this sketch is 0.95.
        Specified by:
        getQuantileUpperBound in interface QuantilesFloatsAPI
        Parameters:
        rank - the given normalized rank
        Returns:
        the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.
      • getQuantileUpperBound

        public float getQuantileUpperBound​(double rank,
                                           int numStdDev)
      • getRank

        public double getRank​(float quantile,
                              QuantileSearchCriteria searchCrit)
        Description copied from interface: QuantilesFloatsAPI
        Gets the normalized rank corresponding to the given a quantile.
        Specified by:
        getRank in interface QuantilesFloatsAPI
        Parameters:
        quantile - the given quantile
        searchCrit - if INCLUSIVE the given quantile is included into the rank.
        Returns:
        the normalized rank corresponding to the given quantile.
        See Also:
        QuantileSearchCriteria
      • getRankLowerBound

        public double getRankLowerBound​(double rank)
        Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists. The approximate probability that the true rank is within the confidence interval specified by the upper and lower rank bounds for this sketch is 0.95.
        Parameters:
        rank - the given normalized rank.
        Returns:
        the lower bound of the rank confidence interval in which the true rank of the given rank exists.
      • getRankLowerBound

        public double getRankLowerBound​(double rank,
                                        int numStdDev)
        Gets an approximate lower bound rank of the given normalized rank.
        Parameters:
        rank - the given rank, a number between 0 and 1.0.
        numStdDev - the number of standard deviations. Must be 1, 2, or 3.
        Returns:
        an approximate lower bound rank.
      • getRanks

        public double[] getRanks​(float[] quantiles,
                                 QuantileSearchCriteria searchCrit)
        Description copied from interface: QuantilesFloatsAPI
        Gets an array of normalized ranks corresponding to the given array of quantiles and the given search criterion.
        Specified by:
        getRanks in interface QuantilesFloatsAPI
        Parameters:
        quantiles - the given array of quantiles
        searchCrit - if INCLUSIVE, the given quantiles include the rank directly corresponding to each quantile.
        Returns:
        an array of normalized ranks corresponding to the given array of quantiles.
        See Also:
        QuantileSearchCriteria
      • getRankUpperBound

        public double getRankUpperBound​(double rank)
        Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists. The approximate probability that the true rank is within the confidence interval specified by the upper and lower rank bounds for this sketch is 0.95.
        Parameters:
        rank - the given normalized rank.
        Returns:
        the upper bound of the rank confidence interval in which the true rank of the given rank exists.
      • getRankUpperBound

        public double getRankUpperBound​(double rank,
                                        int numStdDev)
        Gets an approximate upper bound rank of the given rank.
        Parameters:
        rank - the given rank, a number between 0 and 1.0.
        numStdDev - the number of standard deviations. Must be 1, 2, or 3.
        Returns:
        an approximate upper bound rank.
      • getNumRetained

        public int getNumRetained()
        Description copied from interface: QuantilesAPI
        Gets the number of quantiles retained by the sketch.
        Specified by:
        getNumRetained in interface QuantilesAPI
        Returns:
        the number of quantiles retained by the sketch
      • getSerializedSizeBytes

        public int getSerializedSizeBytes()
        Description copied from interface: QuantilesFloatsAPI
        Returns the current number of bytes this Sketch would require if serialized.
        Specified by:
        getSerializedSizeBytes in interface QuantilesFloatsAPI
        Returns:
        the number of bytes this sketch would require if serialized.
      • isEmpty

        public boolean isEmpty()
        Description copied from interface: QuantilesAPI
        Returns true if this sketch is empty.
        Specified by:
        isEmpty in interface QuantilesAPI
        Returns:
        true if this sketch is empty.
      • isEstimationMode

        public boolean isEstimationMode()
        Description copied from interface: QuantilesAPI
        Returns true if this sketch is in estimation mode.
        Specified by:
        isEstimationMode in interface QuantilesAPI
        Returns:
        true if this sketch is in estimation mode.
      • merge

        public ReqSketch merge​(ReqSketch other)
        Merge other sketch into this one. The other sketch is not modified.
        Parameters:
        other - sketch to be merged into this one.
        Returns:
        this
      • reset

        public void reset()
        Resets this sketch to the empty state. If the sketch is read only this does nothing.

        The parameter k will not change.

        The parameters k, highRankAccuracy, and reqDebug will not change.

        Specified by:
        reset in interface QuantilesAPI
      • toByteArray

        public byte[] toByteArray()
        Description copied from interface: QuantilesFloatsAPI
        Returns a byte array representation of this sketch.
        Specified by:
        toByteArray in interface QuantilesFloatsAPI
        Returns:
        a byte array representation of this sketch.
      • toString

        public String toString()
        Description copied from interface: QuantilesAPI
        Returns a summary of the key parameters of the sketch.
        Specified by:
        toString in interface QuantilesAPI
        Returns:
        a summary of the key parameters of the sketch.
      • update

        public void update​(float item)
        Description copied from interface: QuantilesFloatsAPI
        Updates this sketch with the given item.
        Specified by:
        update in interface QuantilesFloatsAPI
        Parameters:
        item - from a stream of quantiles. NaNs are ignored.
      • viewCompactorDetail

        public String viewCompactorDetail​(String fmt,
                                          boolean allData)
        A detailed, human readable view of the sketch compactors and their data. Each compactor string is prepended by the compactor lgWeight, the current number of retained quantiles of the compactor and the current nominal capacity of the compactor.
        Parameters:
        fmt - the format string for the quantiles; example: "%4.0f".
        allData - all the retained quantiles for the sketch will be output by compactor level. Otherwise, just a summary will be output.
        Returns:
        a detailed view of the compactors and their data
      • getRSE

        public static double getRSE​(int k,
                                    double rank,
                                    boolean hra,
                                    long totalN)
        Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in https://arxiv.org/abs/2004.01668v2, but the constant factors were adjusted based on empirical measurements.
        Parameters:
        k - the given size of k
        rank - the given normalized rank, a number in [0,1].
        hra - if true High Rank Accuracy mode is being selected, otherwise, Low Rank Accuracy.
        totalN - an estimate of the total number of items submitted to the sketch.
        Returns:
        an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).
      • hasMemory

        public boolean hasMemory()
        Description copied from interface: QuantilesAPI
        Returns true if this sketch's data structure is backed by Memory or WritableMemory.
        Specified by:
        hasMemory in interface QuantilesAPI
        Returns:
        true if this sketch's data structure is backed by Memory or WritableMemory.
      • isDirect

        public boolean isDirect()
        Description copied from interface: QuantilesAPI
        Returns true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
        Specified by:
        isDirect in interface QuantilesAPI
        Returns:
        true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
      • isReadOnly

        public boolean isReadOnly()
        Description copied from interface: QuantilesAPI
        Returns true if this sketch is read only.
        Specified by:
        isReadOnly in interface QuantilesAPI
        Returns:
        true if this sketch is read only.