Class CpcSketch

java.lang.Object
org.apache.datasketches.cpc.CpcSketch

public final class CpcSketch extends Object
This is a unique-counting sketch that implements the Compressed Probabilistic Counting (CPC, a.k.a FM85) algorithms developed by Kevin Lang in his paper Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm.

This sketch is extremely space-efficient when serialized. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not. The update speed performance of this sketch is quite fast and is comparable to the speed of HLL. The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K.

For additional security this sketch can be configured with a user-specified hash seed.

Author:
Lee Rhodes, Kevin Lang
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    The default Log_base2 of K
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructor with default log_base2 of k
    CpcSketch(int lgK)
    Constructor with log_base2 of k.
    CpcSketch(int lgK, long seed)
    Constructor with log_base2 of k and seed.
  • Method Summary

    Modifier and Type
    Method
    Description
    double
    Returns the best estimate of the cardinality of the sketch.
    static Family
    Return the DataSketches identifier for this CPC family of sketches.
    int
    Return the parameter LgK.
    double
    getLowerBound(int kappa)
    Returns the best estimate of the lower bound of the confidence interval given kappa, the number of standard deviations from the mean.
    static int
    The actual size of a compressed CPC sketch has a small random variance, but the following empirically measured size should be large enough for at least 99.9 percent of sketches.
    double
    getUpperBound(int kappa)
    Returns the best estimate of the upper bound of the confidence interval given kappa, the number of standard deviations from the mean.
    static CpcSketch
    heapify(byte[] byteArray)
    Return the given byte array as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
    static CpcSketch
    heapify(byte[] byteArray, long seed)
    Return the given byte array as a CpcSketch on the Java heap.
    static CpcSketch
    heapify(org.apache.datasketches.memory.Memory mem)
    Return the given Memory as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
    static CpcSketch
    heapify(org.apache.datasketches.memory.Memory mem, long seed)
    Return the given Memory as a CpcSketch on the Java heap.
    boolean
    Return true if this sketch is empty
    final void
    Resets this sketch to empty but retains the original LgK and Seed.
    byte[]
    Return this sketch as a compressed byte array.
    Return a human-readable string summary of this sketch
    toString(boolean detail)
    Return a human-readable string summary of this sketch
    static String
    toString(byte[] byteArr, boolean detail)
    Returns a human readable string of the preamble of a byte array image of a CpcSketch.
    static String
    toString(org.apache.datasketches.memory.Memory mem, boolean detail)
    Returns a human readable string of the preamble of a Memory image of a CpcSketch.
    void
    update(byte[] data)
    Present the given byte array as a potential unique item.
    void
    update(char[] data)
    Present the given char array as a potential unique item.
    void
    update(double datum)
    Present the given double (or float) datum as a potential unique item.
    void
    update(int[] data)
    Present the given integer array as a potential unique item.
    void
    update(long datum)
    Present the given long as a potential unique item.
    void
    update(long[] data)
    Present the given long array as a potential unique item.
    void
    update(String datum)
    Present the given String as a potential unique item.
    void
    Present the given ByteBuffer as a potential unique item If the ByteBuffer is null or empty no update attempt is made and the method returns
    boolean
    Convience function that this Sketch is valid.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • DEFAULT_LG_K

      public static final int DEFAULT_LG_K
      The default Log_base2 of K
      See Also:
  • Constructor Details

    • CpcSketch

      public CpcSketch()
      Constructor with default log_base2 of k
    • CpcSketch

      public CpcSketch(int lgK)
      Constructor with log_base2 of k.
      Parameters:
      lgK - the given log_base2 of k
    • CpcSketch

      public CpcSketch(int lgK, long seed)
      Constructor with log_base2 of k and seed.
      Parameters:
      lgK - the given log_base2 of k
      seed - the given seed
  • Method Details

    • getEstimate

      public double getEstimate()
      Returns the best estimate of the cardinality of the sketch.
      Returns:
      the best estimate of the cardinality of the sketch.
    • getFamily

      public static Family getFamily()
      Return the DataSketches identifier for this CPC family of sketches.
      Returns:
      the DataSketches identifier for this CPC family of sketches.
    • getLgK

      public int getLgK()
      Return the parameter LgK.
      Returns:
      the parameter LgK.
    • getLowerBound

      public double getLowerBound(int kappa)
      Returns the best estimate of the lower bound of the confidence interval given kappa, the number of standard deviations from the mean.
      Parameters:
      kappa - the given number of standard deviations from the mean: 1, 2 or 3.
      Returns:
      the best estimate of the lower bound of the confidence interval given kappa.
    • getMaxSerializedBytes

      public static int getMaxSerializedBytes(int lgK)
      The actual size of a compressed CPC sketch has a small random variance, but the following empirically measured size should be large enough for at least 99.9 percent of sketches.

      For small values of n the size can be much smaller.

      Parameters:
      lgK - the given value of lgK.
      Returns:
      the estimated maximum compressed serialized size of a sketch.
    • getUpperBound

      public double getUpperBound(int kappa)
      Returns the best estimate of the upper bound of the confidence interval given kappa, the number of standard deviations from the mean.
      Parameters:
      kappa - the given number of standard deviations from the mean: 1, 2 or 3.
      Returns:
      the best estimate of the upper bound of the confidence interval given kappa.
    • heapify

      public static CpcSketch heapify(org.apache.datasketches.memory.Memory mem)
      Return the given Memory as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
      Parameters:
      mem - the given Memory
      Returns:
      the given Memory as a CpcSketch on the Java heap.
    • heapify

      public static CpcSketch heapify(byte[] byteArray)
      Return the given byte array as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
      Parameters:
      byteArray - the given byte array
      Returns:
      the given byte array as a CpcSketch on the Java heap.
    • heapify

      public static CpcSketch heapify(org.apache.datasketches.memory.Memory mem, long seed)
      Return the given Memory as a CpcSketch on the Java heap.
      Parameters:
      mem - the given Memory
      seed - the seed used to create the original sketch from which the Memory was derived.
      Returns:
      the given Memory as a CpcSketch on the Java heap.
    • heapify

      public static CpcSketch heapify(byte[] byteArray, long seed)
      Return the given byte array as a CpcSketch on the Java heap.
      Parameters:
      byteArray - the given byte array
      seed - the seed used to create the original sketch from which the byte array was derived.
      Returns:
      the given byte array as a CpcSketch on the Java heap.
    • isEmpty

      public boolean isEmpty()
      Return true if this sketch is empty
      Returns:
      true if this sketch is empty
    • reset

      public final void reset()
      Resets this sketch to empty but retains the original LgK and Seed.
    • toByteArray

      public byte[] toByteArray()
      Return this sketch as a compressed byte array.
      Returns:
      this sketch as a compressed byte array.
    • update

      public void update(long datum)
      Present the given long as a potential unique item.
      Parameters:
      datum - The given long datum.
    • update

      public void update(double datum)
      Present the given double (or float) datum as a potential unique item. The double will be converted to a long using Double.doubleToLongBits(datum), which normalizes all NaN values to a single NaN representation. Plus and minus zero will be normalized to plus zero. The special floating-point values NaN and +/- Infinity are treated as distinct.
      Parameters:
      datum - The given double datum.
    • update

      public void update(String datum)
      Present the given String as a potential unique item. The string is converted to a byte array using UTF8 encoding. If the string is null or empty no update attempt is made and the method returns.

      Note: About 2X faster performance can be obtained by first converting the String to a char[] and updating the sketch with that. This bypasses the complexity of the Java UTF_8 encoding. This, of course, will not produce the same internal hash values as updating directly with a String. So be consistent! Unioning two sketches, one fed with strings and the other fed with char[] will be meaningless.

      Parameters:
      datum - The given String.
    • update

      public void update(byte[] data)
      Present the given byte array as a potential unique item. If the byte array is null or empty no update attempt is made and the method returns.
      Parameters:
      data - The given byte array.
    • update

      public void update(ByteBuffer data)
      Present the given ByteBuffer as a potential unique item If the ByteBuffer is null or empty no update attempt is made and the method returns
      Parameters:
      data - The given ByteBuffer
    • update

      public void update(char[] data)
      Present the given char array as a potential unique item. If the char array is null or empty no update attempt is made and the method returns.

      Note: this will not produce the same output hash values as the update(String) method but will be a little faster as it avoids the complexity of the UTF8 encoding.

      Parameters:
      data - The given char array.
    • update

      public void update(int[] data)
      Present the given integer array as a potential unique item. If the integer array is null or empty no update attempt is made and the method returns.
      Parameters:
      data - The given int array.
    • update

      public void update(long[] data)
      Present the given long array as a potential unique item. If the long array is null or empty no update attempt is made and the method returns.
      Parameters:
      data - The given long array.
    • validate

      public boolean validate()
      Convience function that this Sketch is valid. This is a troubleshooting tool for sketches that have been heapified from serialized images.

      If you are starting with a serialized image as a byte array, first heapify the byte array to a sketch, which performs a number of checks. Then use this function as one additional check on the sketch.

      Returns:
      true if this sketch is validated.
    • toString

      public String toString()
      Return a human-readable string summary of this sketch
      Overrides:
      toString in class Object
    • toString

      public String toString(boolean detail)
      Return a human-readable string summary of this sketch
      Parameters:
      detail - include data detail
      Returns:
      a human-readable string summary of this sketch
    • toString

      public static String toString(byte[] byteArr, boolean detail)
      Returns a human readable string of the preamble of a byte array image of a CpcSketch.
      Parameters:
      byteArr - the given byte array
      detail - if true, a dump of the compressed window and surprising value streams will be included.
      Returns:
      a human readable string of the preamble of a byte array image of a CpcSketch.
    • toString

      public static String toString(org.apache.datasketches.memory.Memory mem, boolean detail)
      Returns a human readable string of the preamble of a Memory image of a CpcSketch.
      Parameters:
      mem - the given Memory
      detail - if true, a dump of the compressed window and surprising value streams will be included.
      Returns:
      a human readable string of the preamble of a Memory image of a CpcSketch.