Class CpcSketch
- java.lang.Object
-
- org.apache.datasketches.cpc.CpcSketch
-
public final class CpcSketch extends Object
This is a unique-counting sketch that implements the Compressed Probabilistic Counting (CPC, a.k.a FM85) algorithms developed by Kevin Lang in his paper Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm.This sketch is extremely space-efficient when serialized. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not. The update speed performance of this sketch is quite fast and is comparable to the speed of HLL. The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K.
For additional security this sketch can be configured with a user-specified hash seed.
- Author:
- Lee Rhodes, Kevin Lang
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_LG_K
The default Log_base2 of K
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description double
getEstimate()
Returns the best estimate of the cardinality of the sketch.static Family
getFamily()
Return the DataSketches identifier for this CPC family of sketches.int
getLgK()
Return the parameter LgK.double
getLowerBound(int kappa)
Returns the best estimate of the lower bound of the confidence interval given kappa, the number of standard deviations from the mean.static int
getMaxSerializedBytes(int lgK)
The actual size of a compressed CPC sketch has a small random variance, but the following empirically measured size should be large enough for at least 99.9 percent of sketches.double
getUpperBound(int kappa)
Returns the best estimate of the upper bound of the confidence interval given kappa, the number of standard deviations from the mean.static CpcSketch
heapify(byte[] byteArray)
Return the given byte array as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.static CpcSketch
heapify(byte[] byteArray, long seed)
Return the given byte array as a CpcSketch on the Java heap.static CpcSketch
heapify(org.apache.datasketches.memory.Memory mem)
Return the given Memory as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.static CpcSketch
heapify(org.apache.datasketches.memory.Memory mem, long seed)
Return the given Memory as a CpcSketch on the Java heap.boolean
isEmpty()
Return true if this sketch is emptyvoid
reset()
Resets this sketch to empty but retains the original LgK and Seed.byte[]
toByteArray()
Return this sketch as a compressed byte array.String
toString()
Return a human-readable string summary of this sketchString
toString(boolean detail)
Return a human-readable string summary of this sketchstatic String
toString(byte[] byteArr, boolean detail)
Returns a human readable string of the preamble of a byte array image of a CpcSketch.static String
toString(org.apache.datasketches.memory.Memory mem, boolean detail)
Returns a human readable string of the preamble of a Memory image of a CpcSketch.void
update(byte[] data)
Present the given byte array as a potential unique item.void
update(char[] data)
Present the given char array as a potential unique item.void
update(double datum)
Present the given double (or float) datum as a potential unique item.void
update(int[] data)
Present the given integer array as a potential unique item.void
update(long datum)
Present the given long as a potential unique item.void
update(long[] data)
Present the given long array as a potential unique item.void
update(String datum)
Present the given String as a potential unique item.boolean
validate()
Convience function that this Sketch is valid.
-
-
-
Field Detail
-
DEFAULT_LG_K
public static final int DEFAULT_LG_K
The default Log_base2 of K- See Also:
- Constant Field Values
-
-
Constructor Detail
-
CpcSketch
public CpcSketch()
Constructor with default log_base2 of k
-
CpcSketch
public CpcSketch(int lgK)
Constructor with log_base2 of k.- Parameters:
lgK
- the given log_base2 of k
-
CpcSketch
public CpcSketch(int lgK, long seed)
Constructor with log_base2 of k and seed.- Parameters:
lgK
- the given log_base2 of kseed
- the given seed
-
-
Method Detail
-
getEstimate
public double getEstimate()
Returns the best estimate of the cardinality of the sketch.- Returns:
- the best estimate of the cardinality of the sketch.
-
getFamily
public static Family getFamily()
Return the DataSketches identifier for this CPC family of sketches.- Returns:
- the DataSketches identifier for this CPC family of sketches.
-
getLgK
public int getLgK()
Return the parameter LgK.- Returns:
- the parameter LgK.
-
getLowerBound
public double getLowerBound(int kappa)
Returns the best estimate of the lower bound of the confidence interval given kappa, the number of standard deviations from the mean.- Parameters:
kappa
- the given number of standard deviations from the mean: 1, 2 or 3.- Returns:
- the best estimate of the lower bound of the confidence interval given kappa.
-
getMaxSerializedBytes
public static int getMaxSerializedBytes(int lgK)
The actual size of a compressed CPC sketch has a small random variance, but the following empirically measured size should be large enough for at least 99.9 percent of sketches.For small values of n the size can be much smaller.
- Parameters:
lgK
- the given value of lgK.- Returns:
- the estimated maximum compressed serialized size of a sketch.
-
getUpperBound
public double getUpperBound(int kappa)
Returns the best estimate of the upper bound of the confidence interval given kappa, the number of standard deviations from the mean.- Parameters:
kappa
- the given number of standard deviations from the mean: 1, 2 or 3.- Returns:
- the best estimate of the upper bound of the confidence interval given kappa.
-
heapify
public static CpcSketch heapify(org.apache.datasketches.memory.Memory mem)
Return the given Memory as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.- Parameters:
mem
- the given Memory- Returns:
- the given Memory as a CpcSketch on the Java heap.
-
heapify
public static CpcSketch heapify(byte[] byteArray)
Return the given byte array as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.- Parameters:
byteArray
- the given byte array- Returns:
- the given byte array as a CpcSketch on the Java heap.
-
heapify
public static CpcSketch heapify(org.apache.datasketches.memory.Memory mem, long seed)
Return the given Memory as a CpcSketch on the Java heap.- Parameters:
mem
- the given Memoryseed
- the seed used to create the original sketch from which the Memory was derived.- Returns:
- the given Memory as a CpcSketch on the Java heap.
-
heapify
public static CpcSketch heapify(byte[] byteArray, long seed)
Return the given byte array as a CpcSketch on the Java heap.- Parameters:
byteArray
- the given byte arrayseed
- the seed used to create the original sketch from which the byte array was derived.- Returns:
- the given byte array as a CpcSketch on the Java heap.
-
isEmpty
public boolean isEmpty()
Return true if this sketch is empty- Returns:
- true if this sketch is empty
-
reset
public final void reset()
Resets this sketch to empty but retains the original LgK and Seed.
-
toByteArray
public byte[] toByteArray()
Return this sketch as a compressed byte array.- Returns:
- this sketch as a compressed byte array.
-
update
public void update(long datum)
Present the given long as a potential unique item.- Parameters:
datum
- The given long datum.
-
update
public void update(double datum)
Present the given double (or float) datum as a potential unique item. The double will be converted to a long using Double.doubleToLongBits(datum), which normalizes all NaN values to a single NaN representation. Plus and minus zero will be normalized to plus zero. The special floating-point values NaN and +/- Infinity are treated as distinct.- Parameters:
datum
- The given double datum.
-
update
public void update(String datum)
Present the given String as a potential unique item. The string is converted to a byte array using UTF8 encoding. If the string is null or empty no update attempt is made and the method returns.Note: About 2X faster performance can be obtained by first converting the String to a char[] and updating the sketch with that. This bypasses the complexity of the Java UTF_8 encoding. This, of course, will not produce the same internal hash values as updating directly with a String. So be consistent! Unioning two sketches, one fed with strings and the other fed with char[] will be meaningless.
- Parameters:
datum
- The given String.
-
update
public void update(byte[] data)
Present the given byte array as a potential unique item. If the byte array is null or empty no update attempt is made and the method returns.- Parameters:
data
- The given byte array.
-
update
public void update(char[] data)
Present the given char array as a potential unique item. If the char array is null or empty no update attempt is made and the method returns.Note: this will not produce the same output hash values as the
update(String)
method but will be a little faster as it avoids the complexity of the UTF8 encoding.- Parameters:
data
- The given char array.
-
update
public void update(int[] data)
Present the given integer array as a potential unique item. If the integer array is null or empty no update attempt is made and the method returns.- Parameters:
data
- The given int array.
-
update
public void update(long[] data)
Present the given long array as a potential unique item. If the long array is null or empty no update attempt is made and the method returns.- Parameters:
data
- The given long array.
-
validate
public boolean validate()
Convience function that this Sketch is valid. This is a troubleshooting tool for sketches that have been heapified from serialized images.If you are starting with a serialized image as a byte array, first heapify the byte array to a sketch, which performs a number of checks. Then use this function as one additional check on the sketch.
- Returns:
- true if this sketch is validated.
-
toString
public String toString()
Return a human-readable string summary of this sketch
-
toString
public String toString(boolean detail)
Return a human-readable string summary of this sketch- Parameters:
detail
- include data detail- Returns:
- a human-readable string summary of this sketch
-
toString
public static String toString(byte[] byteArr, boolean detail)
Returns a human readable string of the preamble of a byte array image of a CpcSketch.- Parameters:
byteArr
- the given byte arraydetail
- if true, a dump of the compressed window and surprising value streams will be included.- Returns:
- a human readable string of the preamble of a byte array image of a CpcSketch.
-
toString
public static String toString(org.apache.datasketches.memory.Memory mem, boolean detail)
Returns a human readable string of the preamble of a Memory image of a CpcSketch.- Parameters:
mem
- the given Memorydetail
- if true, a dump of the compressed window and surprising value streams will be included.- Returns:
- a human readable string of the preamble of a Memory image of a CpcSketch.
-
-