Class KllItemsSketch<T>
- java.lang.Object
-
- org.apache.datasketches.kll.KllSketch
-
- org.apache.datasketches.kll.KllItemsSketch<T>
-
- Type Parameters:
T
- The sketch data type.
- All Implemented Interfaces:
PartitioningFeature<T>
,QuantilesAPI
,QuantilesGenericAPI<T>
,SketchPartitionLimits
public abstract class KllItemsSketch<T> extends KllSketch implements QuantilesGenericAPI<T>
This variation of the KllSketch implements generic data types. The user must provide a suitable implementation of the java.lang.Comparator as well as an implementation of the serializer / deserializer, org.apache.datasketches.common.ArrayOfItemsSerDe.- See Also:
KllSketch
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.datasketches.kll.KllSketch
KllSketch.SketchStructure, KllSketch.SketchType
-
-
Field Summary
-
Fields inherited from interface org.apache.datasketches.quantilescommon.QuantilesAPI
EMPTY_MSG, MEM_REQ_SVR_NULL_MSG, NOT_SINGLE_ITEM_MSG, SELF_MERGE_MSG, TGT_IS_READ_ONLY_MSG, UNSUPPORTED_MSG
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description double[]
getCDF(T[] splitPoints, QuantileSearchCriteria searchCrit)
Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as a monotonically increasing array of double ranks (or cumulative probabilities) on the interval [0.0, 1.0], given a set of splitPoints.Class<T>
getClassOfT()
Comparator<? super T>
getComparator()
Returns the Comparator of TGenericPartitionBoundaries<T>
getPartitionBoundariesFromNumParts(int numEquallySizedParts, QuantileSearchCriteria searchCrit)
This method returns an instance ofGenericPartitionBoundaries
which provides sufficient information for the user to create the given number of equally sized partitions, where "equally sized" refers to an approximately equal number of items per partition.GenericPartitionBoundaries<T>
getPartitionBoundariesFromPartSize(long nominalPartSizeItems, QuantileSearchCriteria searchCrit)
This method returns an instance ofGenericPartitionBoundaries
which provides sufficient information for the user to create the given number of equally sized partitions, where "equally sized" refers to an approximately equal number of items per partition.double[]
getPMF(T[] splitPoints, QuantileSearchCriteria searchCrit)
Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses as doubles on the interval [0.0, 1.0], given a set of splitPoints.T
getQuantile(double rank, QuantileSearchCriteria searchCrit)
Gets the approximate quantile of the given normalized rank and the given search criterion.T
getQuantileLowerBound(double rank)
Gets the lower bound of the quantile confidence interval in which the quantile of the given rank exists.T[]
getQuantiles(double[] ranks, QuantileSearchCriteria searchCrit)
Gets an array of quantiles from the given array of normalized ranks.T
getQuantileUpperBound(double rank)
Gets the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.double
getRank(T quantile, QuantileSearchCriteria searchCrit)
Gets the normalized rank corresponding to the given a quantile.double
getRankLowerBound(double rank)
Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists.double[]
getRanks(T[] quantiles, QuantileSearchCriteria searchCrit)
Gets an array of normalized ranks corresponding to the given array of quantiles and the given search criterion.double
getRankUpperBound(double rank)
Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists.ItemsSketchSortedView<T>
getSortedView()
Gets the sorted view of this sketchstatic <T> KllItemsSketch<T>
heapify(org.apache.datasketches.memory.Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
Factory heapify takes a compact sketch image in Memory and instantiates an on-heap sketch.QuantilesGenericSketchIterator<T>
iterator()
Gets the iterator for this sketch, which is not sorted.void
merge(KllSketch other)
Merges another sketch into this one.static <T> KllItemsSketch<T>
newHeapInstance(int k, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
Create a new heap instance of this sketch with a given parameter k.static <T> KllItemsSketch<T>
newHeapInstance(Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
Create a new heap instance of this sketch with the default k = 200.void
reset()
Resets this sketch to the empty state.byte[]
toByteArray()
Export the current sketch as a compact byte array.String
toString(boolean withLevels, boolean withLevelsAndItems)
Returns human readable summary information about this sketch.void
update(T item)
Updates this sketch with the given item.void
update(T item, long weight)
Weighted update.static <T> KllItemsSketch<T>
wrap(org.apache.datasketches.memory.Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
Constructs a thin wrapper on the heap around a Memory (or WritableMemory) already initialized with a validated sketch image of a type T consistent with the given comparator and serDe.-
Methods inherited from class org.apache.datasketches.kll.KllSketch
getKFromEpsilon, getMaxSerializedSizeBytes, getNormalizedRankError, getNormalizedRankError, getNumRetained, getSerializedSizeBytes, hasMemory, isCompactMemoryFormat, isDirect, isEmpty, isEstimationMode, isMemoryUpdatableFormat, isReadOnly, isSameResource, toString
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface org.apache.datasketches.quantilescommon.PartitioningFeature
getPartitionBoundariesFromNumParts, getPartitionBoundariesFromPartSize
-
Methods inherited from interface org.apache.datasketches.quantilescommon.QuantilesAPI
getK, getN, getNormalizedRankError, getNumRetained, hasMemory, isDirect, isEmpty, isEstimationMode, isReadOnly, toString
-
Methods inherited from interface org.apache.datasketches.quantilescommon.QuantilesGenericAPI
getCDF, getMaxItem, getMaxPartitions, getMinItem, getPMF, getQuantile, getQuantiles, getRank, getRanks
-
Methods inherited from interface org.apache.datasketches.quantilescommon.SketchPartitionLimits
getMinPartitionSizeItems, getN
-
-
-
-
Method Detail
-
newHeapInstance
public static <T> KllItemsSketch<T> newHeapInstance(Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
Create a new heap instance of this sketch with the default k = 200. The default k = 200 results in a normalized rank error of about 1.65%. Larger K will have smaller error but the sketch will be larger (and slower).- Type Parameters:
T
- The sketch data type.- Parameters:
comparator
- to compare itemsserDe
- Serializer / deserializer for an array of items, T[].- Returns:
- new KllItemsSketch on the Java heap.
-
newHeapInstance
public static <T> KllItemsSketch<T> newHeapInstance(int k, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
Create a new heap instance of this sketch with a given parameter k. k can be between DEFAULT_M and 65535, inclusive. The default k = 200 results in a normalized rank error of about 1.65%. Larger K will have smaller error but the sketch will be larger (and slower).- Type Parameters:
T
- The sketch data type- Parameters:
k
- parameter that controls size of the sketch and accuracy of estimates.comparator
- to compare itemsserDe
- Serializer / deserializer for items of type T and T[].- Returns:
- new KllItemsSketch on the heap.
-
heapify
public static <T> KllItemsSketch<T> heapify(org.apache.datasketches.memory.Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
Factory heapify takes a compact sketch image in Memory and instantiates an on-heap sketch. The resulting sketch will not retain any link to the source Memory.- Type Parameters:
T
- The sketch data type- Parameters:
srcMem
- a compact Memory image of a sketch serialized by this sketch and of the same type of T.comparator
- to compare itemsserDe
- Serializer / deserializer for items of type T and T[].- Returns:
- a heap-based sketch based on the given Memory.
-
wrap
public static <T> KllItemsSketch<T> wrap(org.apache.datasketches.memory.Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
Constructs a thin wrapper on the heap around a Memory (or WritableMemory) already initialized with a validated sketch image of a type T consistent with the given comparator and serDe. A reference to the Memory is kept in the sketch and must remain in scope consistent with the temporal scope of this sketch. The amount of data kept on the heap is very small. All of the item data originally collected by the given Memory sketch object remains in the Memory object- Type Parameters:
T
- The sketch data type- Parameters:
srcMem
- the Memory object that this sketch will wrap.comparator
- to compare itemsserDe
- Serializer / deserializer for items of type T and T[].- Returns:
- a heap-base sketch that is a thin wrapper around the given srcMem.
-
getCDF
public double[] getCDF(T[] splitPoints, QuantileSearchCriteria searchCrit)
Description copied from interface:QuantilesGenericAPI
Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as a monotonically increasing array of double ranks (or cumulative probabilities) on the interval [0.0, 1.0], given a set of splitPoints.The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.
- Specified by:
getCDF
in interfaceQuantilesGenericAPI<T>
- Parameters:
splitPoints
- an array of m unique, monotonically increasing items (of the same type as the input items) that divide the item input domain into m+1 overlapping intervals.The start of each interval is below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and the end of the interval is the rank or cumulative probability corresponding to the split point.
The (m+1)th interval represents 100% of the distribution represented by the sketch and consistent with the definition of a cumulative probability distribution, thus the (m+1)th rank or probability in the returned array is always 1.0.
If a split point exactly equals a retained item of the sketch and the search criterion is:
- INCLUSIVE, the resulting cumulative probability will include that item.
- EXCLUSIVE, the resulting cumulative probability will not include the weight of that split point.
It is not recommended to include either the minimum or maximum items of the input stream.
searchCrit
- the desired search criteria.- Returns:
- a discrete CDF array of m+1 double ranks (or cumulative probabilities) on the interval [0.0, 1.0].
-
getClassOfT
public Class<T> getClassOfT()
- Specified by:
getClassOfT
in interfaceQuantilesGenericAPI<T>
- Returns:
- the sketch item class
-
getComparator
public Comparator<? super T> getComparator()
Description copied from interface:QuantilesGenericAPI
Returns the Comparator of T- Specified by:
getComparator
in interfaceQuantilesGenericAPI<T>
- Returns:
- Comparator of the sketch
-
getPartitionBoundariesFromNumParts
public GenericPartitionBoundaries<T> getPartitionBoundariesFromNumParts(int numEquallySizedParts, QuantileSearchCriteria searchCrit)
Description copied from interface:PartitioningFeature
This method returns an instance ofGenericPartitionBoundaries
which provides sufficient information for the user to create the given number of equally sized partitions, where "equally sized" refers to an approximately equal number of items per partition.The sketch must not be empty.
- Specified by:
getPartitionBoundariesFromNumParts
in interfacePartitioningFeature<T>
- Parameters:
numEquallySizedParts
- an integer that specifies the number of equally sized partitions betweengetMinItem()
andgetMaxItem()
. This must be a positive integer less thangetMaxPartitions()
- A 1 will return: minItem, maxItem.
- A 2 will return: minItem, median quantile, maxItem.
- Etc.
searchCrit
- If INCLUSIVE, all the returned quantiles are the upper boundaries of the equally sized partitions with the exception of the lowest returned quantile, which is the lowest boundary of the lowest ranked partition. If EXCLUSIVE, all the returned quantiles are the lower boundaries of the equally sized partitions with the exception of the highest returned quantile, which is the upper boundary of the highest ranked partition.- Returns:
- an instance of
GenericPartitionBoundaries
.
-
getPartitionBoundariesFromPartSize
public GenericPartitionBoundaries<T> getPartitionBoundariesFromPartSize(long nominalPartSizeItems, QuantileSearchCriteria searchCrit)
Description copied from interface:PartitioningFeature
This method returns an instance ofGenericPartitionBoundaries
which provides sufficient information for the user to create the given number of equally sized partitions, where "equally sized" refers to an approximately equal number of items per partition.The sketch must not be empty.
- Specified by:
getPartitionBoundariesFromPartSize
in interfacePartitioningFeature<T>
- Parameters:
nominalPartSizeItems
- an integer that specifies the nominal size, in items, of each target partition. This must be a positive integer greater thangetMinPartitionSizeItems()
.searchCrit
- If INCLUSIVE, all the returned quantiles are the upper boundaries of the equally sized partitions with the exception of the lowest returned quantile, which is the lowest boundary of the lowest ranked partition. If EXCLUSIVE, all the returned quantiles are the lower boundaries of the equally sized partitions with the exception of the highest returned quantile, which is the upper boundary of the highest ranked partition.- Returns:
- an instance of
GenericPartitionBoundaries
.
-
getPMF
public double[] getPMF(T[] splitPoints, QuantileSearchCriteria searchCrit)
Description copied from interface:QuantilesGenericAPI
Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses as doubles on the interval [0.0, 1.0], given a set of splitPoints.The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(true) function.
- Specified by:
getPMF
in interfaceQuantilesGenericAPI<T>
- Parameters:
splitPoints
- an array of m unique, monotonically increasing items (of the same type as the input items) that divide the item input domain into m+1 consecutive, non-overlapping intervals.Each interval except for the end intervals starts with a split point and ends with the next split point in sequence.
The first interval starts below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and ends with the first split point
The last (m+1)th interval starts with the last split point and ends after the last item retained by the sketch corresponding to a rank or probability of 1.0.
The sum of the probability masses of all (m+1) intervals is 1.0.
If the search criterion is:
- INCLUSIVE, and the upper split point of an interval equals an item retained by the sketch, the interval will include that item. If the lower split point equals an item retained by the sketch, the interval will exclude that item.
- EXCLUSIVE, and the upper split point of an interval equals an item retained by the sketch, the interval will exclude that item. If the lower split point equals an item retained by the sketch, the interval will include that item.
It is not recommended to include either the minimum or maximum items of the input stream.
searchCrit
- the desired search criteria.- Returns:
- a PMF array of m+1 probability masses as doubles on the interval [0.0, 1.0].
-
getQuantile
public T getQuantile(double rank, QuantileSearchCriteria searchCrit)
Description copied from interface:QuantilesGenericAPI
Gets the approximate quantile of the given normalized rank and the given search criterion.- Specified by:
getQuantile
in interfaceQuantilesGenericAPI<T>
- Parameters:
rank
- the given normalized rank, a double in the range [0.0, 1.0].searchCrit
- If INCLUSIVE, the given rank includes all quantiles ≤ the quantile directly corresponding to the given rank. If EXCLUSIVE, he given rank includes all quantiles < the quantile directly corresponding to the given rank.- Returns:
- the approximate quantile given the normalized rank.
- See Also:
QuantileSearchCriteria
-
getQuantiles
public T[] getQuantiles(double[] ranks, QuantileSearchCriteria searchCrit)
Description copied from interface:QuantilesGenericAPI
Gets an array of quantiles from the given array of normalized ranks.- Specified by:
getQuantiles
in interfaceQuantilesGenericAPI<T>
- Parameters:
ranks
- the given array of normalized ranks, each of which must be in the interval [0.0,1.0].searchCrit
- if INCLUSIVE, the given ranks include all quantiles ≤ the quantile directly corresponding to each rank.- Returns:
- an array of quantiles corresponding to the given array of normalized ranks.
- See Also:
QuantileSearchCriteria
-
getQuantileLowerBound
public T getQuantileLowerBound(double rank)
Description copied from interface:QuantilesGenericAPI
Gets the lower bound of the quantile confidence interval in which the quantile of the given rank exists.Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile confidence interval as an additive or multiplicative percent of the true quantile.
- Specified by:
getQuantileLowerBound
in interfaceQuantilesGenericAPI<T>
- Parameters:
rank
- the given normalized rank- Returns:
- the lower bound of the quantile confidence interval in which the quantile of the given rank exists.
-
getQuantileUpperBound
public T getQuantileUpperBound(double rank)
Description copied from interface:QuantilesGenericAPI
Gets the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile interval as an additive or multiplicative percent of the true quantile.
- Specified by:
getQuantileUpperBound
in interfaceQuantilesGenericAPI<T>
- Parameters:
rank
- the given normalized rank- Returns:
- the upper bound of the quantile confidence interval in which the true quantile of the given rank exists.
-
getRank
public double getRank(T quantile, QuantileSearchCriteria searchCrit)
Description copied from interface:QuantilesGenericAPI
Gets the normalized rank corresponding to the given a quantile.- Specified by:
getRank
in interfaceQuantilesGenericAPI<T>
- Parameters:
quantile
- the given quantilesearchCrit
- if INCLUSIVE the given quantile is included into the rank.- Returns:
- the normalized rank corresponding to the given quantile.
- See Also:
QuantileSearchCriteria
-
getRankLowerBound
public double getRankLowerBound(double rank)
Gets the lower bound of the rank confidence interval in which the true rank of the given rank exists. The approximate probability that the true rank is within the confidence interval specified by the upper and lower rank bounds for this sketch is 0.99.- Specified by:
getRankLowerBound
in interfaceQuantilesAPI
- Parameters:
rank
- the given normalized rank.- Returns:
- the lower bound of the rank confidence interval in which the true rank of the given rank exists.
-
getRankUpperBound
public double getRankUpperBound(double rank)
Gets the upper bound of the rank confidence interval in which the true rank of the given rank exists. The approximate probability that the true rank is within the confidence interval specified by the upper and lower rank bounds for this sketch is 0.99.- Specified by:
getRankUpperBound
in interfaceQuantilesAPI
- Parameters:
rank
- the given normalized rank.- Returns:
- the upper bound of the rank confidence interval in which the true rank of the given rank exists.
-
getRanks
public double[] getRanks(T[] quantiles, QuantileSearchCriteria searchCrit)
Description copied from interface:QuantilesGenericAPI
Gets an array of normalized ranks corresponding to the given array of quantiles and the given search criterion.- Specified by:
getRanks
in interfaceQuantilesGenericAPI<T>
- Parameters:
quantiles
- the given array of quantilessearchCrit
- if INCLUSIVE, the given quantiles include the rank directly corresponding to each quantile.- Returns:
- an array of normalized ranks corresponding to the given array of quantiles.
- See Also:
QuantileSearchCriteria
-
getSortedView
public final ItemsSketchSortedView<T> getSortedView()
Description copied from interface:QuantilesGenericAPI
Gets the sorted view of this sketch- Specified by:
getSortedView
in interfaceQuantilesGenericAPI<T>
- Returns:
- the sorted view of this sketch
-
iterator
public QuantilesGenericSketchIterator<T> iterator()
Description copied from interface:QuantilesGenericAPI
Gets the iterator for this sketch, which is not sorted.- Specified by:
iterator
in interfaceQuantilesGenericAPI<T>
- Returns:
- the iterator for this sketch
-
merge
public final void merge(KllSketch other)
Description copied from class:KllSketch
Merges another sketch into this one. Attempting to merge a sketch of the wrong type will throw an exception.
-
reset
public void reset()
Description copied from interface:QuantilesAPI
Resets this sketch to the empty state. If the sketch is read only this does nothing.The parameter k will not change.
- Specified by:
reset
in interfaceQuantilesAPI
-
toByteArray
public byte[] toByteArray()
Export the current sketch as a compact byte array.- Returns:
- the current sketch as a compact byte array.
-
toString
public String toString(boolean withLevels, boolean withLevelsAndItems)
Description copied from class:KllSketch
Returns human readable summary information about this sketch. Used for debugging.
-
update
public void update(T item)
Description copied from interface:QuantilesGenericAPI
Updates this sketch with the given item.- Specified by:
update
in interfaceQuantilesGenericAPI<T>
- Parameters:
item
- from a stream of items. Nulls are ignored.
-
update
public void update(T item, long weight)
Weighted update. Updates this sketch with the given item the number of times specified by the given integer weight.- Parameters:
item
- the item to be repeated. NaNs are ignored.weight
- the number of times the update of item is to be repeated. It must be ≥ one.
-
-