Class CpcSketch


  • public final class CpcSketch
    extends Object
    This is a unique-counting sketch that implements the Compressed Probabilistic Counting (CPC, a.k.a FM85) algorithms developed by Kevin Lang in his paper Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm.

    This sketch is extremely space-efficient when serialized. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not. The update speed performance of this sketch is quite fast and is comparable to the speed of HLL. The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K.

    For additional security this sketch can be configured with a user-specified hash seed.

    Author:
    Lee Rhodes, Kevin Lang
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int DEFAULT_LG_K
      The default Log_base2 of K
    • Constructor Summary

      Constructors 
      Constructor Description
      CpcSketch()
      Constructor with default log_base2 of k
      CpcSketch​(int lgK)
      Constructor with log_base2 of k.
      CpcSketch​(int lgK, long seed)
      Constructor with log_base2 of k and seed.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      double getEstimate()
      Returns the best estimate of the cardinality of the sketch.
      static Family getFamily()
      Return the DataSketches identifier for this CPC family of sketches.
      int getLgK()
      Return the parameter LgK.
      double getLowerBound​(int kappa)
      Returns the best estimate of the lower bound of the confidence interval given kappa, the number of standard deviations from the mean.
      static int getMaxSerializedBytes​(int lgK)
      The actual size of a compressed CPC sketch has a small random variance, but the following empirically measured size should be large enough for at least 99.9 percent of sketches.
      double getUpperBound​(int kappa)
      Returns the best estimate of the upper bound of the confidence interval given kappa, the number of standard deviations from the mean.
      static CpcSketch heapify​(byte[] byteArray)
      Return the given byte array as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
      static CpcSketch heapify​(byte[] byteArray, long seed)
      Return the given byte array as a CpcSketch on the Java heap.
      static CpcSketch heapify​(org.apache.datasketches.memory.Memory mem)
      Return the given Memory as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
      static CpcSketch heapify​(org.apache.datasketches.memory.Memory mem, long seed)
      Return the given Memory as a CpcSketch on the Java heap.
      boolean isEmpty()
      Return true if this sketch is empty
      void reset()
      Resets this sketch to empty but retains the original LgK and Seed.
      byte[] toByteArray()
      Return this sketch as a compressed byte array.
      String toString()
      Return a human-readable string summary of this sketch
      String toString​(boolean detail)
      Return a human-readable string summary of this sketch
      static String toString​(byte[] byteArr, boolean detail)
      Returns a human readable string of the preamble of a byte array image of a CpcSketch.
      static String toString​(org.apache.datasketches.memory.Memory mem, boolean detail)
      Returns a human readable string of the preamble of a Memory image of a CpcSketch.
      void update​(byte[] data)
      Present the given byte array as a potential unique item.
      void update​(char[] data)
      Present the given char array as a potential unique item.
      void update​(double datum)
      Present the given double (or float) datum as a potential unique item.
      void update​(int[] data)
      Present the given integer array as a potential unique item.
      void update​(long datum)
      Present the given long as a potential unique item.
      void update​(long[] data)
      Present the given long array as a potential unique item.
      void update​(String datum)
      Present the given String as a potential unique item.
      void update​(ByteBuffer data)
      Present the given ByteBuffer as a potential unique item If the ByteBuffer is null or empty no update attempt is made and the method returns
      boolean validate()
      Convience function that this Sketch is valid.
    • Field Detail

      • DEFAULT_LG_K

        public static final int DEFAULT_LG_K
        The default Log_base2 of K
        See Also:
        Constant Field Values
    • Constructor Detail

      • CpcSketch

        public CpcSketch()
        Constructor with default log_base2 of k
      • CpcSketch

        public CpcSketch​(int lgK)
        Constructor with log_base2 of k.
        Parameters:
        lgK - the given log_base2 of k
      • CpcSketch

        public CpcSketch​(int lgK,
                         long seed)
        Constructor with log_base2 of k and seed.
        Parameters:
        lgK - the given log_base2 of k
        seed - the given seed
    • Method Detail

      • getEstimate

        public double getEstimate()
        Returns the best estimate of the cardinality of the sketch.
        Returns:
        the best estimate of the cardinality of the sketch.
      • getFamily

        public static Family getFamily()
        Return the DataSketches identifier for this CPC family of sketches.
        Returns:
        the DataSketches identifier for this CPC family of sketches.
      • getLgK

        public int getLgK()
        Return the parameter LgK.
        Returns:
        the parameter LgK.
      • getLowerBound

        public double getLowerBound​(int kappa)
        Returns the best estimate of the lower bound of the confidence interval given kappa, the number of standard deviations from the mean.
        Parameters:
        kappa - the given number of standard deviations from the mean: 1, 2 or 3.
        Returns:
        the best estimate of the lower bound of the confidence interval given kappa.
      • getMaxSerializedBytes

        public static int getMaxSerializedBytes​(int lgK)
        The actual size of a compressed CPC sketch has a small random variance, but the following empirically measured size should be large enough for at least 99.9 percent of sketches.

        For small values of n the size can be much smaller.

        Parameters:
        lgK - the given value of lgK.
        Returns:
        the estimated maximum compressed serialized size of a sketch.
      • getUpperBound

        public double getUpperBound​(int kappa)
        Returns the best estimate of the upper bound of the confidence interval given kappa, the number of standard deviations from the mean.
        Parameters:
        kappa - the given number of standard deviations from the mean: 1, 2 or 3.
        Returns:
        the best estimate of the upper bound of the confidence interval given kappa.
      • heapify

        public static CpcSketch heapify​(org.apache.datasketches.memory.Memory mem)
        Return the given Memory as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
        Parameters:
        mem - the given Memory
        Returns:
        the given Memory as a CpcSketch on the Java heap.
      • heapify

        public static CpcSketch heapify​(byte[] byteArray)
        Return the given byte array as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
        Parameters:
        byteArray - the given byte array
        Returns:
        the given byte array as a CpcSketch on the Java heap.
      • heapify

        public static CpcSketch heapify​(org.apache.datasketches.memory.Memory mem,
                                        long seed)
        Return the given Memory as a CpcSketch on the Java heap.
        Parameters:
        mem - the given Memory
        seed - the seed used to create the original sketch from which the Memory was derived.
        Returns:
        the given Memory as a CpcSketch on the Java heap.
      • heapify

        public static CpcSketch heapify​(byte[] byteArray,
                                        long seed)
        Return the given byte array as a CpcSketch on the Java heap.
        Parameters:
        byteArray - the given byte array
        seed - the seed used to create the original sketch from which the byte array was derived.
        Returns:
        the given byte array as a CpcSketch on the Java heap.
      • isEmpty

        public boolean isEmpty()
        Return true if this sketch is empty
        Returns:
        true if this sketch is empty
      • reset

        public final void reset()
        Resets this sketch to empty but retains the original LgK and Seed.
      • toByteArray

        public byte[] toByteArray()
        Return this sketch as a compressed byte array.
        Returns:
        this sketch as a compressed byte array.
      • update

        public void update​(long datum)
        Present the given long as a potential unique item.
        Parameters:
        datum - The given long datum.
      • update

        public void update​(double datum)
        Present the given double (or float) datum as a potential unique item. The double will be converted to a long using Double.doubleToLongBits(datum), which normalizes all NaN values to a single NaN representation. Plus and minus zero will be normalized to plus zero. The special floating-point values NaN and +/- Infinity are treated as distinct.
        Parameters:
        datum - The given double datum.
      • update

        public void update​(String datum)
        Present the given String as a potential unique item. The string is converted to a byte array using UTF8 encoding. If the string is null or empty no update attempt is made and the method returns.

        Note: About 2X faster performance can be obtained by first converting the String to a char[] and updating the sketch with that. This bypasses the complexity of the Java UTF_8 encoding. This, of course, will not produce the same internal hash values as updating directly with a String. So be consistent! Unioning two sketches, one fed with strings and the other fed with char[] will be meaningless.

        Parameters:
        datum - The given String.
      • update

        public void update​(byte[] data)
        Present the given byte array as a potential unique item. If the byte array is null or empty no update attempt is made and the method returns.
        Parameters:
        data - The given byte array.
      • update

        public void update​(ByteBuffer data)
        Present the given ByteBuffer as a potential unique item If the ByteBuffer is null or empty no update attempt is made and the method returns
        Parameters:
        data - The given ByteBuffer
      • update

        public void update​(char[] data)
        Present the given char array as a potential unique item. If the char array is null or empty no update attempt is made and the method returns.

        Note: this will not produce the same output hash values as the update(String) method but will be a little faster as it avoids the complexity of the UTF8 encoding.

        Parameters:
        data - The given char array.
      • update

        public void update​(int[] data)
        Present the given integer array as a potential unique item. If the integer array is null or empty no update attempt is made and the method returns.
        Parameters:
        data - The given int array.
      • update

        public void update​(long[] data)
        Present the given long array as a potential unique item. If the long array is null or empty no update attempt is made and the method returns.
        Parameters:
        data - The given long array.
      • validate

        public boolean validate()
        Convience function that this Sketch is valid. This is a troubleshooting tool for sketches that have been heapified from serialized images.

        If you are starting with a serialized image as a byte array, first heapify the byte array to a sketch, which performs a number of checks. Then use this function as one additional check on the sketch.

        Returns:
        true if this sketch is validated.
      • toString

        public String toString()
        Return a human-readable string summary of this sketch
        Overrides:
        toString in class Object
      • toString

        public String toString​(boolean detail)
        Return a human-readable string summary of this sketch
        Parameters:
        detail - include data detail
        Returns:
        a human-readable string summary of this sketch
      • toString

        public static String toString​(byte[] byteArr,
                                      boolean detail)
        Returns a human readable string of the preamble of a byte array image of a CpcSketch.
        Parameters:
        byteArr - the given byte array
        detail - if true, a dump of the compressed window and surprising value streams will be included.
        Returns:
        a human readable string of the preamble of a byte array image of a CpcSketch.
      • toString

        public static String toString​(org.apache.datasketches.memory.Memory mem,
                                      boolean detail)
        Returns a human readable string of the preamble of a Memory image of a CpcSketch.
        Parameters:
        mem - the given Memory
        detail - if true, a dump of the compressed window and surprising value streams will be included.
        Returns:
        a human readable string of the preamble of a Memory image of a CpcSketch.