Class ReqSketch
- All Implemented Interfaces:
QuantilesAPI
,QuantilesFloatsAPI
Reference: https://arxiv.org/abs/2004.01668
This implementation differs from the algorithm described in the paper in the following:
- The algorithm requires no upper bound on the stream length. Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). Initially, the relative-compactor starts with INIT_NUMBER_OF_SECTIONS. Each time the number of compactions (variable state) exceeds 2^{numSections - 1}, we double numSections. Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.
- The size of each section (variable k and sectionSize in the code and parameter k in the paper) is initialized with a number set by the user via variable k. When the number of sections doubles, we decrease sectionSize by a factor of sqrt(2). This is applied at each level separately. Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. sqrt(2) (+/- rounding).
- The merge operation here does not perform "special compactions", which are used in the paper to allow for a tight mathematical analysis of the sketch.
This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.
- The Python prototype only implemented high accuracy for low ranks. This implementation provides the user with the ability to choose either high rank accuracy or low rank accuracy at the time of sketch construction.
- The Python prototype only implemented a comparison criterion of "INCLUSIVE". This implementation allows the user to switch back and forth between the "INCLUSIVE" criterion and the "EXCLUSIVE" criterion.
- This implementation provides extensive debug visibility into the operation of the sketch with two levels of detail output. This is not only useful for debugging, but is a powerful tool to help users understand how the sketch works.
- Author:
- Edo Liberty, Pavel Vesely, Lee Rhodes
- See Also:
-
Field Summary
Fields inherited from interface org.apache.datasketches.quantilescommon.QuantilesAPI
EMPTY_MSG, MEM_REQ_SVR_NULL_MSG, NOT_SINGLE_ITEM_MSG, SELF_MERGE_MSG, TGT_IS_READ_ONLY_MSG, UNSUPPORTED_MSG
-
Method Summary
Modifier and TypeMethodDescriptionstatic final ReqSketchBuilder
builder()
Returns a new ReqSketchBuilderdouble[]
getCDF
(float[] splitPoints, QuantileSearchCriteria searchCrit) Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as a monotonically increasing array of double ranks (or cumulative probabilities) on the interval [0.0, 1.0], given a set of splitPoints.boolean
If true, the high ranks are prioritized for better accuracy.int
getK()
Gets the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.float
Returns the maximum item of the stream.float
Returns the minimum item of the stream.long
getN()
Gets the length of the input stream offered to the sketch..double
getNormalizedRankError
(boolean pmf) Gets the approximate rank error of this sketch normalized as a fraction between zero and one.int
Gets the number of quantiles retained by the sketch.double[]
getPMF
(float[] splitPoints, QuantileSearchCriteria searchCrit) Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses as doubles on the interval [0.0, 1.0], given a set of splitPoints.float
getQuantile
(double normRank, QuantileSearchCriteria searchCrit) Gets the approximate quantile of the given normalized rank and the given search criterion.float
getQuantileLowerBound
(double rank) Gets the lower bound of the quantile confidence interval in which the quantile of the given rank exists.float
getQuantileLowerBound
(double rank, int numStdDev) Gets an approximate lower bound of the quantile associated with the given rank.float[]
getQuantiles
(double[] normRanks, QuantileSearchCriteria searchCrit) Gets an array of quantiles from the given array of normalized ranks.float
getQuantileUpperBound
(double rank) Gets the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.float
getQuantileUpperBound
(double rank, int numStdDev) Gets an approximate upper bound of the quantile associated with the given rank.double
getRank
(float quantile, QuantileSearchCriteria searchCrit) Gets the normalized rank corresponding to the given a quantile.double
getRankLowerBound
(double rank) Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists.double
getRankLowerBound
(double rank, int numStdDev) Gets an approximate lower bound rank of the given normalized rank.double[]
getRanks
(float[] quantiles, QuantileSearchCriteria searchCrit) Gets an array of normalized ranks corresponding to the given array of quantiles and the given search criterion.double
getRankUpperBound
(double rank) Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists.double
getRankUpperBound
(double rank, int numStdDev) Gets an approximate upper bound rank of the given rank.static double
getRSE
(int k, double rank, boolean hra, long totalN) Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).int
Returns the current number of bytes this Sketch would require if serialized.Gets the sorted view of this sketchboolean
Returns true if this sketch's data structure is backed by Memory or WritableMemory.static ReqSketch
heapify
(org.apache.datasketches.memory.Memory mem) Returns an ReqSketch on the heap from a Memory image of the sketch.boolean
isDirect()
Returns true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).boolean
isEmpty()
Returns true if this sketch is empty.boolean
Returns true if this sketch is in estimation mode.boolean
Returns true if this sketch is read only.iterator()
Gets the iterator for this sketch, which is not sorted.Merge other sketch into this one.void
reset()
Resets this sketch to the empty state.byte[]
Returns a byte array representation of this sketch.toString()
Returns a summary of the key parameters of the sketch.void
update
(float item) Updates this sketch with the given item.viewCompactorDetail
(String fmt, boolean allData) A detailed, human readable view of the sketch compactors and their data.Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface org.apache.datasketches.quantilescommon.QuantilesFloatsAPI
getCDF, getPMF, getQuantile, getQuantiles, getRank, getRanks
-
Method Details
-
builder
Returns a new ReqSketchBuilder- Returns:
- a new ReqSketchBuilder
-
heapify
Returns an ReqSketch on the heap from a Memory image of the sketch.- Parameters:
mem
- The Memory object holding a valid image of an ReqSketch- Returns:
- an ReqSketch on the heap from a Memory image of the sketch.
-
getK
public int getK()Description copied from interface:QuantilesAPI
Gets the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.- Specified by:
getK
in interfaceQuantilesAPI
- Returns:
- the user configured parameter k, which controls the accuracy of the sketch and its memory space usage.
-
getCDF
Description copied from interface:QuantilesFloatsAPI
Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as a monotonically increasing array of double ranks (or cumulative probabilities) on the interval [0.0, 1.0], given a set of splitPoints.The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.
- Specified by:
getCDF
in interfaceQuantilesFloatsAPI
- Parameters:
splitPoints
- an array of m unique, monotonically increasing items (of the same type as the input items) that divide the item input domain into m+1 overlapping intervals.The start of each interval is below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and the end of the interval is the rank or cumulative probability corresponding to the split point.
The (m+1)th interval represents 100% of the distribution represented by the sketch and consistent with the definition of a cumulative probability distribution, thus the (m+1)th rank or probability in the returned array is always 1.0.
If a split point exactly equals a retained item of the sketch and the search criterion is:
- INCLUSIVE, the resulting cumulative probability will include that item.
- EXCLUSIVE, the resulting cumulative probability will not include the weight of that split point.
It is not recommended to include either the minimum or maximum items of the input stream.
searchCrit
- the desired search criteria.- Returns:
- a discrete CDF array of m+1 double ranks (or cumulative probabilities) on the interval [0.0, 1.0].
-
getHighRankAccuracyMode
public boolean getHighRankAccuracyMode()If true, the high ranks are prioritized for better accuracy. Otherwise the low ranks are prioritized for better accuracy. This state is chosen during sketch construction.- Returns:
- the high ranks accuracy state.
-
getMaxItem
public float getMaxItem()Description copied from interface:QuantilesFloatsAPI
Returns the maximum item of the stream. This is provided for convenience, but may be different from the largest item retained by the sketch algorithm.- Specified by:
getMaxItem
in interfaceQuantilesFloatsAPI
- Returns:
- the maximum item of the stream
-
getMinItem
public float getMinItem()Description copied from interface:QuantilesFloatsAPI
Returns the minimum item of the stream. This is provided for convenience, but is distinct from the smallest item retained by the sketch algorithm.- Specified by:
getMinItem
in interfaceQuantilesFloatsAPI
- Returns:
- the minimum item of the stream
-
getN
public long getN()Description copied from interface:QuantilesAPI
Gets the length of the input stream offered to the sketch..- Specified by:
getN
in interfaceQuantilesAPI
- Returns:
- the length of the input stream offered to the sketch.
-
getNormalizedRankError
public double getNormalizedRankError(boolean pmf) Description copied from interface:QuantilesAPI
Gets the approximate rank error of this sketch normalized as a fraction between zero and one. The epsilon returned is a best fit to 99 percent confidence empirically measured max error in thousands of trials.- Parameters:
pmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.- Returns:
- if pmf is true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
-
getPMF
Description copied from interface:QuantilesFloatsAPI
Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses as doubles on the interval [0.0, 1.0], given a set of splitPoints.The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(true) function.
- Specified by:
getPMF
in interfaceQuantilesFloatsAPI
- Parameters:
splitPoints
- an array of m unique, monotonically increasing items (of the same type as the input items) that divide the item input domain into m+1 consecutive, non-overlapping intervals.Each interval except for the end intervals starts with a split point and ends with the next split point in sequence.
The first interval starts below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and ends with the first split point
The last (m+1)th interval starts with the last split point and ends after the last item retained by the sketch corresponding to a rank or probability of 1.0.
The sum of the probability masses of all (m+1) intervals is 1.0.
If the search criterion is:
- INCLUSIVE, and the upper split point of an interval equals an item retained by the sketch, the interval will include that item. If the lower split point equals an item retained by the sketch, the interval will exclude that item.
- EXCLUSIVE, and the upper split point of an interval equals an item retained by the sketch, the interval will exclude that item. If the lower split point equals an item retained by the sketch, the interval will include that item.
It is not recommended to include either the minimum or maximum items of the input stream.
searchCrit
- the desired search criteria.- Returns:
- a PMF array of m+1 probability masses as doubles on the interval [0.0, 1.0].
-
getQuantile
Description copied from interface:QuantilesFloatsAPI
Gets the approximate quantile of the given normalized rank and the given search criterion.- Specified by:
getQuantile
in interfaceQuantilesFloatsAPI
- Parameters:
normRank
- the given normalized rank, a double in the range [0.0, 1.0].searchCrit
- If INCLUSIVE, the given rank includes all quantiles ≤ the quantile directly corresponding to the given rank. If EXCLUSIVE, he given rank includes all quantiles < the quantile directly corresponding to the given rank.- Returns:
- the approximate quantile given the normalized rank.
- See Also:
-
getQuantiles
Description copied from interface:QuantilesFloatsAPI
Gets an array of quantiles from the given array of normalized ranks.- Specified by:
getQuantiles
in interfaceQuantilesFloatsAPI
- Parameters:
normRanks
- the given array of normalized ranks, each of which must be in the interval [0.0,1.0].searchCrit
- if INCLUSIVE, the given ranks include all quantiles ≤ the quantile directly corresponding to each rank.- Returns:
- an array of quantiles corresponding to the given array of normalized ranks.
- See Also:
-
getQuantileLowerBound
public float getQuantileLowerBound(double rank) Gets the lower bound of the quantile confidence interval in which the quantile of the given rank exists.Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile confidence interval as an additive or multiplicative percent of the true quantile.
The approximate probability that the true quantile is within the confidence interval specified by the upper and lower quantile bounds for this sketch is 0.95.- Specified by:
getQuantileLowerBound
in interfaceQuantilesFloatsAPI
- Parameters:
rank
- the given normalized rank- Returns:
- the lower bound of the quantile confidence interval in which the quantile of the given rank exists.
-
getQuantileLowerBound
public float getQuantileLowerBound(double rank, int numStdDev) Gets an approximate lower bound of the quantile associated with the given rank.- Parameters:
rank
- the given normalized rank, a number between 0 and 1.0.numStdDev
- the number of standard deviations. Must be 1, 2, or 3.- Returns:
- an approximate lower bound quantile, if it exists.
-
getQuantileUpperBound
public float getQuantileUpperBound(double rank) Gets the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile interval as an additive or multiplicative percent of the true quantile.
The approximate probability that the true quantile is within the confidence interval specified by the upper and lower quantile bounds for this sketch is 0.95.- Specified by:
getQuantileUpperBound
in interfaceQuantilesFloatsAPI
- Parameters:
rank
- the given normalized rank- Returns:
- the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.
-
getQuantileUpperBound
public float getQuantileUpperBound(double rank, int numStdDev) Gets an approximate upper bound of the quantile associated with the given rank.- Parameters:
rank
- the given normalized rank, a number between 0 and 1.0.numStdDev
- the number of standard deviations. Must be 1, 2, or 3.- Returns:
- an approximate upper bound quantile, if it exists.
-
getRank
Description copied from interface:QuantilesFloatsAPI
Gets the normalized rank corresponding to the given a quantile.- Specified by:
getRank
in interfaceQuantilesFloatsAPI
- Parameters:
quantile
- the given quantilesearchCrit
- if INCLUSIVE the given quantile is included into the rank.- Returns:
- the normalized rank corresponding to the given quantile.
- See Also:
-
getRankLowerBound
public double getRankLowerBound(double rank) Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists. The approximate probability that the true rank is within the confidence interval specified by the upper and lower rank bounds for this sketch is 0.95.- Parameters:
rank
- the given normalized rank.- Returns:
- the lower bound of the rank confidence interval in which the true rank of the given rank exists.
-
getRankLowerBound
public double getRankLowerBound(double rank, int numStdDev) Gets an approximate lower bound rank of the given normalized rank.- Parameters:
rank
- the given normalized rank, a number between 0 and 1.0.numStdDev
- the number of standard deviations. Must be 1, 2, or 3.- Returns:
- an approximate lower bound rank.
-
getRanks
Description copied from interface:QuantilesFloatsAPI
Gets an array of normalized ranks corresponding to the given array of quantiles and the given search criterion.- Specified by:
getRanks
in interfaceQuantilesFloatsAPI
- Parameters:
quantiles
- the given array of quantilessearchCrit
- if INCLUSIVE, the given quantiles include the rank directly corresponding to each quantile.- Returns:
- an array of normalized ranks corresponding to the given array of quantiles.
- See Also:
-
getRankUpperBound
public double getRankUpperBound(double rank) Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists. The approximate probability that the true rank is within the confidence interval specified by the upper and lower rank bounds for this sketch is 0.95.- Parameters:
rank
- the given normalized rank.- Returns:
- the upper bound of the rank confidence interval in which the true rank of the given rank exists.
-
getRankUpperBound
public double getRankUpperBound(double rank, int numStdDev) Gets an approximate upper bound rank of the given rank.- Parameters:
rank
- the given rank, a number between 0 and 1.0.numStdDev
- the number of standard deviations. Must be 1, 2, or 3.- Returns:
- an approximate upper bound rank.
-
getNumRetained
public int getNumRetained()Description copied from interface:QuantilesAPI
Gets the number of quantiles retained by the sketch.- Specified by:
getNumRetained
in interfaceQuantilesAPI
- Returns:
- the number of quantiles retained by the sketch
-
getSerializedSizeBytes
public int getSerializedSizeBytes()Description copied from interface:QuantilesFloatsAPI
Returns the current number of bytes this Sketch would require if serialized.- Specified by:
getSerializedSizeBytes
in interfaceQuantilesFloatsAPI
- Returns:
- the number of bytes this sketch would require if serialized.
-
isEmpty
public boolean isEmpty()Description copied from interface:QuantilesAPI
Returns true if this sketch is empty.- Specified by:
isEmpty
in interfaceQuantilesAPI
- Returns:
- true if this sketch is empty.
-
isEstimationMode
public boolean isEstimationMode()Description copied from interface:QuantilesAPI
Returns true if this sketch is in estimation mode.- Specified by:
isEstimationMode
in interfaceQuantilesAPI
- Returns:
- true if this sketch is in estimation mode.
-
iterator
Description copied from interface:QuantilesFloatsAPI
Gets the iterator for this sketch, which is not sorted.- Specified by:
iterator
in interfaceQuantilesFloatsAPI
- Returns:
- the iterator for this sketch
-
merge
Merge other sketch into this one. The other sketch is not modified.- Parameters:
other
- sketch to be merged into this one.- Returns:
- this
-
reset
public void reset()Resets this sketch to the empty state. If the sketch is read only this does nothing.The parameter k will not change.
The parameters k, highRankAccuracy, and reqDebug will not change.
- Specified by:
reset
in interfaceQuantilesAPI
-
toByteArray
public byte[] toByteArray()Description copied from interface:QuantilesFloatsAPI
Returns a byte array representation of this sketch.- Specified by:
toByteArray
in interfaceQuantilesFloatsAPI
- Returns:
- a byte array representation of this sketch.
-
toString
Description copied from interface:QuantilesAPI
Returns a summary of the key parameters of the sketch.- Specified by:
toString
in interfaceQuantilesAPI
- Returns:
- a summary of the key parameters of the sketch.
-
update
public void update(float item) Description copied from interface:QuantilesFloatsAPI
Updates this sketch with the given item.- Specified by:
update
in interfaceQuantilesFloatsAPI
- Parameters:
item
- from a stream of quantiles. NaNs are ignored.
-
viewCompactorDetail
A detailed, human readable view of the sketch compactors and their data. Each compactor string is prepended by the compactor lgWeight, the current number of retained quantiles of the compactor and the current nominal capacity of the compactor.- Parameters:
fmt
- the format string for the quantiles; example: "%4.0f".allData
- all the retained quantiles for the sketch will be output by compactor level. Otherwise, just a summary will be output.- Returns:
- a detailed view of the compactors and their data
-
getSortedView
Description copied from interface:QuantilesFloatsAPI
Gets the sorted view of this sketch- Specified by:
getSortedView
in interfaceQuantilesFloatsAPI
- Returns:
- the sorted view of this sketch
-
getRSE
public static double getRSE(int k, double rank, boolean hra, long totalN) Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in https://arxiv.org/abs/2004.01668v2, but the constant factors were adjusted based on empirical measurements.- Parameters:
k
- the given size of krank
- the given normalized rank, a number in [0,1].hra
- if true High Rank Accuracy mode is being selected, otherwise, Low Rank Accuracy.totalN
- an estimate of the total number of items submitted to the sketch.- Returns:
- an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).
-
hasMemory
public boolean hasMemory()Description copied from interface:QuantilesAPI
Returns true if this sketch's data structure is backed by Memory or WritableMemory.- Specified by:
hasMemory
in interfaceQuantilesAPI
- Returns:
- true if this sketch's data structure is backed by Memory or WritableMemory.
-
isDirect
public boolean isDirect()Description copied from interface:QuantilesAPI
Returns true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).- Specified by:
isDirect
in interfaceQuantilesAPI
- Returns:
- true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
-
isReadOnly
public boolean isReadOnly()Description copied from interface:QuantilesAPI
Returns true if this sketch is read only.- Specified by:
isReadOnly
in interfaceQuantilesAPI
- Returns:
- true if this sketch is read only.
-